Post

Release Chameleon Model

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Curiosity: Will Chameleon be Meta Llama 4? ๐ŸฆŽ ๐Ÿฆ™ Meta proposes โ€œChameleon: Mixed-Modal Early-Fusion Foundation Modelsโ€ with a unified approach for fully token-based representations of both image and text. No Encoders or connectors. ๐Ÿ‘€

Architecture Overview

Retrieve: Chameleon uses a unified token-based approach for multimodal understanding and generation.

graph TB
    A[Input] --> B[Image Tokenizer]
    A --> C[Text Tokenizer]
    B --> D[1024 Image Tokens]
    C --> E[Text Tokens]
    D --> F[Unified Token Sequence]
    E --> F
    F --> G[Llama 2 Decoder]
    G --> H[Output: Text/Image]
    
    style A fill:#e1f5ff
    style F fill:#fff3cd
    style H fill:#d4edda

Implementation Details

StepComponentDetails
1. TokenizersImage + TextImage: 512ร—512 โ†’ 1024 tokens (codebook 8192)
Text: BPE vocab 65,536 (includes image tokens)
2. ArchitectureLlama 2 DecoderQuery-key normalization
Layer norm reordering
Stabilized mixed-modal training
3. Pretraining Stage 180% of trainingText-only: 2.9T tokens
Text-image: 1.4B pairs/1.5T tokens
Interleaved: 400B tokens
4. Pretraining Stage 220% of trainingHigher quality data
Instruction data
Half dataset size
5. Fine-tuningFinal stage~1.8M samples
~100k vision samples

Training Data Breakdown

pie title Training Data Distribution
    "Text-only (2.9T)" : 60
    "Text-Image (1.5T)" : 31
    "Interleaved (400B)" : 9

Key Insights

Retrieve: Chameleonโ€™s unified token-based approach enables native multimodal understanding and generation.

InsightDescriptionImpact
Unified TokensNo encoders/connectorsโฌ†๏ธ Native multimodal generation
Training Scale9.2T tokens, 2.1 epochsโฌ†๏ธ Strong performance
Code DataImproved reasoningโฌ†๏ธ Text-only tasks
Scaling ChallengeDifficult above 8B/1Tโš ๏ธ Training stability
High-Quality DataLast 20% crucialโฌ†๏ธ Significant boost
PerformanceOutperforms competitorsโฌ†๏ธ Strong results

Performance Comparison

Innovate: Chameleon-34B achieves competitive performance across benchmarks.

Text Tasks:

  • Outperforms Llama2-70B
  • Approaches Mixtral 8x7B/Gemini-Pro
  • Strong on GSM8K, MATH, MMLU

Vision Tasks:

  • Outperforms Flamingo-80B and IDEFICS-80B on MS-COCO
  • Matches performance on Flickr30k

Multimodal Evaluation:

  • 60.4% win rate vs. Gemini-Pro
  • 51.6% win rate vs. GPT-4V

Comparison with Previous MLLMs

ModelArchitectureMultimodal Generation
Idefics, GPT-4v, FlamingoEncoders + ConnectorsโŒ Limited
ChameleonUnified Tokensโœ… Native support

Key Advantage: Chameleon can generate both text and images using discrete tokens, enabling true multimodal document generation.

Key Takeaways

Retrieve: Chameleon demonstrates that unified token-based representations can achieve strong multimodal performance without separate encoders or connectors.

Innovate: By using discrete tokens for both images and text, Chameleon enables native multimodal understanding and generation, approaching GPT-4oโ€™s capabilities with a simpler architecture.

Curiosity โ†’ Retrieve โ†’ Innovation: Start with curiosity about unified multimodal models, retrieve insights from Chameleonโ€™s token-based approach, and innovate by building applications that leverage native multimodal generation.

Next Steps:

  • Read the full paper
  • Explore Chameleon architecture
  • Compare with GPT-4o
  • Build multimodal applications

Paper: https://huggingface.co/papers/2405.09818

Note: Chameleon looks to be closer to OpenAI GPT-4o than Uni-MoE (shared yesterday) with its native multi-modal tokens. ๐Ÿ’ก

Translate to Korean

Chameleon: Mixed-Modal Early-Fusion Foundation Models

์นด๋ฉœ๋ ˆ์˜จ์€ ๋ผ๋งˆ 4Meta ๋ ๊นŒ์š”? ๐ŸฆŽ ๐Ÿฆ™ Meta๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ชจ๋‘๋ฅผ ์™„์ „ํžˆ ํ† ํฐ ๊ธฐ๋ฐ˜์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํ†ตํ•ฉ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด โ€œChameleon: Mixed-Modal Early-Fusion Foundation Modelsโ€๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ธ์ฝ”๋” ๋˜๋Š” ์ปค๋„ฅํ„ฐ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๐Ÿ‘€

Implementation:

  • 1๏ธโƒฃ ํ›ˆ๋ จ๋œ 2๊ฐœ์˜ ํ† ํฌ๋‚˜์ด์ €, 512 ร— 512 ์ด๋ฏธ์ง€๋ฅผ ์ฝ”๋“œ๋ถ(8192)์—์„œ 1024๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ์ด๋ฏธ์ง€ ํ† ํฌ๋‚˜์ด์ €์™€ 8192 ์ด๋ฏธ์ง€ ์ฝ”๋“œ๋ถ ํ† ํฐ์„ ํฌํ•จํ•˜๋Š” 65,536์˜ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„ BPE.
  • 2๏ธโƒฃ๋Š” Llama 2๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋””์ฝ”๋” ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ ์ฟผ๋ฆฌ ํ‚ค ์ •๊ทœํ™” ๋ฐ ๋ ˆ์ด์–ด ๊ทœ๋ฒ”์˜ ์žฌ์ •๋ ฌ์„ ํ†ตํ•ฉํ•˜์—ฌ ํ˜ผํ•ฉ ๋ชจ๋‹ฌ ์„ค์ •์—์„œ ํ›ˆ๋ จ์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • 3๏ธโƒฃ ํ…์ŠคํŠธ ์ „์šฉ(Llama 2, CodeLlama โ‡’ 2.9T ํ† ํฐ), ํ…์ŠคํŠธ ์ด๋ฏธ์ง€(1.4B ์Œ/1.5T ํ† ํฐ), ํ…์ŠคํŠธ/์ด๋ฏธ์ง€ ์ธํ„ฐ๋ฆฌ๋ธŒ(400B ํ† ํฐ)์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ•™์Šต 1๋‹จ๊ณ„(80%);
  • 4๏ธโƒฃ ์‚ฌ์ „ ํ•™์Šต 2๋‹จ๊ณ„ (20%) ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ณ  ๋” ๋†’์€ ํ’ˆ์งˆ์˜ ๋ฐ์ดํ„ฐ์™€ ์ง€์นจ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • 5๏ธโƒฃ ~100k ๋น„์ „ ์ƒ˜ํ”Œ๋กœ ~180๋งŒ ๊ฐœ์˜ ์ƒ˜ํ”Œ์— ๋ฏธ์„ธ ์กฐ์ •.

Insights:

  • ๐Ÿ”— ์ด์ „ MLLM(Idefics, GPT-4v, Flamingo)์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์œ„ํ•ด ์ธ์ฝ”๋”์™€ ์ปค๋„ฅํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฌธ์„œ(์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์ถœ๋ ฅ)๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์ œํ•œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๐ŸฆŽ ์นด๋ฉœ๋ ˆ์˜จ์€ ๊ฐœ๋ณ„ ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ์ดํ•ดํ•˜๊ณ  ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ๐Ÿ“š Chameleon-34B๋Š” ์ด 9.2T ํ† ํฐ์— ๋Œ€ํ•ด ์ „์ฒด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ 2.1 epoch ๋™์•ˆ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿ”ง ์ฝ”๋“œ ๋ฐ์ดํ„ฐ๋Š” ํ…์ŠคํŠธ ์ „์šฉ ์ถ”๋ก  ์ž‘์—… ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • โš–๏ธ ์นด๋ฉœ๋ ˆ์˜จ ๋ชจ๋ธ์„ 8B ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ 1T ํ† ํฐ ์ด์ƒ์œผ๋กœ ํ™•์žฅํ•  ๋•Œ ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ์„ ์œ ์ง€ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿš€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์‚ฌ์ „ ํ•™์Šต์˜ ๋งˆ์ง€๋ง‰ 20%๋Š” ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿ† Chameleon-34B๋Š” Llama2-70B๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ Mixtral 8x7B/Gemini-Pro, GSM8K, MATH ๋ฐ MMLU์— ๊ทผ์ ‘ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“Š Chameleon-34B๋Š” MS-COCO์—์„œ Flamingo-80B ๋ฐ IDEFICS-80B๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ Flickr30k์—์„œ๋„ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๐ŸŽฏ Chameleon-34B๋Š” Gemini-Pro๋ฅผ ์ƒ๋Œ€๋กœ 60.4%, GPT-4V๋ฅผ ์ƒ๋Œ€๋กœ 51.6%์˜ ์Šน๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • โš–๏ธ ๊ท ํ˜• ์žกํžŒ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ๋ฏธ์„ธ ์กฐ์ • ๋ฐ ์ •๋ ฌ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Paper: https://huggingface.co/papers/2405.09818

์ฐธ๊ณ : ์นด๋ฉœ๋ ˆ์˜จ์€ ๊ธฐ๋ณธ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ† ํฐ์ด ์žˆ๋Š” Uni-MoE(์–ด์ œ ๊ณต์œ )๋ณด๋‹ค OpenAI GPT-4o์— ๋” ๊ฐ€๊นŒ์›Œ ๋ณด์ž…๋‹ˆ๋‹ค. ๐Ÿ’ก

This post is licensed under CC BY 4.0 by the author.