Post

Release Chameleon Model

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Will Chameleon be Meta Llama 4? ๐ŸฆŽ ๐Ÿฆ™ Meta proposes โ€œChameleon: Mixed-Modal Early-Fusion Foundation Modelsโ€ with a unified approach for fully token-based representations of both image and text. No Encoders or connectors. ๐Ÿ‘€

Implementation:

  • 1๏ธโƒฃ Trained 2 tokenizers, an image Tokenizer that encodes a 512 ร— 512 image into 1024 tokens from a codebook (8192) and a BPE with a vocab of 65,536, which includes the 8192 image codebook token.
  • 2๏ธโƒฃ uses a Decoder architecture based on Llama 2 but incorporates query-key normalization and reordering of layer norms to stabilize training in the mixed-modal setting.
  • 3๏ธโƒฃ Pretraining stage 1 (80%) unsupervised training on text-only (Llama 2, CodeLlama โ‡’ 2.9T tokens), text-image (1.4B pairs/1.5T tokens), Text/Image Interleaved (400B tokens);
  • 4๏ธโƒฃ Pretraining stage 2 (20%) Halved the dataset of first stage and include higher quality data and instruction data.
  • 5๏ธโƒฃ Fine-tuned on ~1.8 million samples with ~100k vision samples.

Insights:

  • ๐Ÿ”— Previous MLLM (Idefics, GPT-4v, Flamingo) used encoders and connectors for multimodality, which limited their ability to generate multimodal documents (image + text outputs).
  • ๐ŸฆŽ Chameleon can understand and generate both text and images using discrete tokens
  • ๐Ÿ“š Chameleon-34B trained for 2.1 epochs over our full training dataset for a total of 9.2T tokens.
  • ๐Ÿ”ง Code Data improved text-only reasoning tasks performance.
  • โš–๏ธ Challenging to maintain stable training when scaling the Chameleon models above 8B parameters and 1T tokens.
  • ๐Ÿš€ The last 20% of pre-training with high-quality data significantly boosted performance.
  • ๐Ÿ† Chameleon-34B outperforms Llama2-70B and approaches Mixtral 8x7B/Gemini-Pro, GSM8K, MATH, and MMLU.
  • ๐Ÿ“Š Chameleon-34B outperforms Flamingo-80B and IDEFICS-80B on MS-COCO and matches on Flickr30k.
  • ๐ŸŽฏ Chameleon-34B achieves 60.4% win rate against Gemini-Pro and a 51.6% against GPT-4V.
  • โš–๏ธ Balanced modality datasets are important for Fine-tuning and Alignment.

Paper: https://huggingface.co/papers/2405.09818

Note: Chameleon looks to be closer to OpenAI GPT-4o than Uni-MoE (shared yesterday) with its native multi-modal tokens. ๐Ÿ’ก

Translate to Korean

Chameleon: Mixed-Modal Early-Fusion Foundation Models

์นด๋ฉœ๋ ˆ์˜จ์€ ๋ผ๋งˆ 4Meta ๋ ๊นŒ์š”? ๐ŸฆŽ ๐Ÿฆ™ Meta๋Š” ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ชจ๋‘๋ฅผ ์™„์ „ํžˆ ํ† ํฐ ๊ธฐ๋ฐ˜์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ํ†ตํ•ฉ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด โ€œChameleon: Mixed-Modal Early-Fusion Foundation Modelsโ€๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ธ์ฝ”๋” ๋˜๋Š” ์ปค๋„ฅํ„ฐ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๐Ÿ‘€

Implementation:

  • 1๏ธโƒฃ ํ›ˆ๋ จ๋œ 2๊ฐœ์˜ ํ† ํฌ๋‚˜์ด์ €, 512 ร— 512 ์ด๋ฏธ์ง€๋ฅผ ์ฝ”๋“œ๋ถ(8192)์—์„œ 1024๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ์ด๋ฏธ์ง€ ํ† ํฌ๋‚˜์ด์ €์™€ 8192 ์ด๋ฏธ์ง€ ์ฝ”๋“œ๋ถ ํ† ํฐ์„ ํฌํ•จํ•˜๋Š” 65,536์˜ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„ BPE.
  • 2๏ธโƒฃ๋Š” Llama 2๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋””์ฝ”๋” ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ ์ฟผ๋ฆฌ ํ‚ค ์ •๊ทœํ™” ๋ฐ ๋ ˆ์ด์–ด ๊ทœ๋ฒ”์˜ ์žฌ์ •๋ ฌ์„ ํ†ตํ•ฉํ•˜์—ฌ ํ˜ผํ•ฉ ๋ชจ๋‹ฌ ์„ค์ •์—์„œ ํ›ˆ๋ จ์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • 3๏ธโƒฃ ํ…์ŠคํŠธ ์ „์šฉ(Llama 2, CodeLlama โ‡’ 2.9T ํ† ํฐ), ํ…์ŠคํŠธ ์ด๋ฏธ์ง€(1.4B ์Œ/1.5T ํ† ํฐ), ํ…์ŠคํŠธ/์ด๋ฏธ์ง€ ์ธํ„ฐ๋ฆฌ๋ธŒ(400B ํ† ํฐ)์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ•™์Šต 1๋‹จ๊ณ„(80%);
  • 4๏ธโƒฃ ์‚ฌ์ „ ํ•™์Šต 2๋‹จ๊ณ„ (20%) ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ณ  ๋” ๋†’์€ ํ’ˆ์งˆ์˜ ๋ฐ์ดํ„ฐ์™€ ์ง€์นจ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • 5๏ธโƒฃ ~100k ๋น„์ „ ์ƒ˜ํ”Œ๋กœ ~180๋งŒ ๊ฐœ์˜ ์ƒ˜ํ”Œ์— ๋ฏธ์„ธ ์กฐ์ •.

Insights:

  • ๐Ÿ”— ์ด์ „ MLLM(Idefics, GPT-4v, Flamingo)์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์œ„ํ•ด ์ธ์ฝ”๋”์™€ ์ปค๋„ฅํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฌธ์„œ(์ด๋ฏธ์ง€ + ํ…์ŠคํŠธ ์ถœ๋ ฅ)๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์ œํ•œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๐ŸฆŽ ์นด๋ฉœ๋ ˆ์˜จ์€ ๊ฐœ๋ณ„ ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ์ดํ•ดํ•˜๊ณ  ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ๐Ÿ“š Chameleon-34B๋Š” ์ด 9.2T ํ† ํฐ์— ๋Œ€ํ•ด ์ „์ฒด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ 2.1 epoch ๋™์•ˆ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿ”ง ์ฝ”๋“œ ๋ฐ์ดํ„ฐ๋Š” ํ…์ŠคํŠธ ์ „์šฉ ์ถ”๋ก  ์ž‘์—… ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • โš–๏ธ ์นด๋ฉœ๋ ˆ์˜จ ๋ชจ๋ธ์„ 8B ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ 1T ํ† ํฐ ์ด์ƒ์œผ๋กœ ํ™•์žฅํ•  ๋•Œ ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ์„ ์œ ์ง€ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿš€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์‚ฌ์ „ ํ•™์Šต์˜ ๋งˆ์ง€๋ง‰ 20%๋Š” ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ๐Ÿ† Chameleon-34B๋Š” Llama2-70B๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ Mixtral 8x7B/Gemini-Pro, GSM8K, MATH ๋ฐ MMLU์— ๊ทผ์ ‘ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“Š Chameleon-34B๋Š” MS-COCO์—์„œ Flamingo-80B ๋ฐ IDEFICS-80B๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ Flickr30k์—์„œ๋„ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๐ŸŽฏ Chameleon-34B๋Š” Gemini-Pro๋ฅผ ์ƒ๋Œ€๋กœ 60.4%, GPT-4V๋ฅผ ์ƒ๋Œ€๋กœ 51.6%์˜ ์Šน๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • โš–๏ธ ๊ท ํ˜• ์žกํžŒ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ๋ฏธ์„ธ ์กฐ์ • ๋ฐ ์ •๋ ฌ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Paper: https://huggingface.co/papers/2405.09818

์ฐธ๊ณ : ์นด๋ฉœ๋ ˆ์˜จ์€ ๊ธฐ๋ณธ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ† ํฐ์ด ์žˆ๋Š” Uni-MoE(์–ด์ œ ๊ณต์œ )๋ณด๋‹ค OpenAI GPT-4o์— ๋” ๊ฐ€๊นŒ์›Œ ๋ณด์ž…๋‹ˆ๋‹ค. ๐Ÿ’ก

This post is licensed under CC BY 4.0 by the author.