Post

๐— ๐—ฎ๐—ฟ๐—ธ ๐—ญ๐˜‚๐—ฐ๐—ธ๐—ฒ๐—ฟ๐—ฏ๐—ฒ๐—ฟ๐—ด ๐—ท๐˜‚๐˜€๐˜ ๐—ฎ๐—ป๐—ป๐—ผ๐˜‚๐—ป๐—ฐ๐—ฒ๐—ฑ ๐˜๐—ต๐—ฒ ๐—š๐—ฃ๐—ง-๐Ÿฐ ๐—ธ๐—ถ๐—น๐—น๐—ฒ๐—ฟ, ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ-๐Ÿฏ.๐Ÿญ ๐Ÿ’ฅ

 Overview of Llama 3.1

Metaโ€™s Llama-3.1

Metaโ€™s Llama-3.1 patches the 8B and 70B Llama-3 models, already top performers in their weight class, to make them even better + gives us the strongest open-source model ever with the 405B.

Two main points:

๐Ÿซ… ๐—ง๐—ต๐—ฒ ๐—ป๐—ฒ๐˜„ ๐—ธ๐—ถ๐—ป๐—ด ๐—ผ๐—ณ ๐—ข๐—ฆ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€: ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ-๐Ÿฏ.๐Ÿญ-๐Ÿฐ๐Ÿฌ๐Ÿฑ๐—• on par or above GPT-4o on many benchmarks.

If confirmed on further testing, it is officially the first time that an OS model becomes the strongest model overall, on top of all models from anthropic and OpenAI!

Let me repeat this: ๐˜๐—ต๐—ฒ ๐˜€๐˜๐—ฟ๐—ผ๐—ป๐—ด๐—ฒ๐˜€๐˜ ๐—Ÿ๐—Ÿ๐—  ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐—ฑ๐—ผ๐˜„๐—ป๐—น๐—ผ๐—ฎ๐—ฑ๐—ฒ๐—ฑ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜๐—ต๐—ฒ ๐—›๐˜‚๐—ฏ.

๐Ÿ“š ๐—ง๐—ต๐—ฒ ๐Ÿด๐—• ๐—ฎ๐—ป๐—ฑ ๐Ÿณ๐Ÿฌ๐—• ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—ฒ๐˜…๐˜๐—ฒ๐—ป๐—ฑ๐—ฒ๐—ฑ ๐˜๐—ผ ๐Ÿญ๐Ÿฎ๐Ÿด๐—ธ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต.

The previous models were limited to 8k tokens, meaning they could process at max as much text as around 15 pages in a Word doc: this was a terrible blocker anytime you need a bit of memory, like for RAG or agent workflows.

Well, not anymore! โœ… Now we get a much more comfortable 128k context length for all sizes, which is great for most my agentic use-cases.

Both points above are huge and would be newsworthy, dropping these together in a โ€œ3.1โ€ version is crazy! ๐Ÿคฏ

๐—ง๐—ฒ๐—ฐ๐—ต๐—ป๐—ถ๐—ฐ๐—ฎ๐—น ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿซ… ๐—” ๐—ป๐—ฒ๐˜„ ๐Ÿฐ๐Ÿฌ๐Ÿฑ๐—•, ๐—ฝ๐—ผ๐˜€๐˜€๐—ถ๐—ฏ๐—น๐˜† ๐˜๐—ต๐—ฒ ๐˜€๐˜๐—ฟ๐—ผ๐—ป๐—ด๐—ฒ๐˜€๐˜ ๐—Ÿ๐—Ÿ๐—  ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ, with 128k context length, 88.6% on MMLU, a crazy 96.8% on GSM8K.

  • โžค The 405B has a FP8 quantized version. FP8 quantization was only applied to the major linear operators of the model, such as the gate and up and down projections for the FFNs (covering 75% of the inference FLOPs).
  • โžค You still need 8xH100 to run it with full context length.

๐Ÿฆฃ Improved 8B & 70B models, with a ๐—บ๐˜‚๐—ฐ๐—ต ๐—น๐—ฎ๐—ฟ๐—ด๐—ฒ๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐˜€๐—ถ๐˜‡๐—ฒ ๐—ผ๐—ณ ๐Ÿญ๐Ÿฎ๐Ÿด๐—ธ ๐˜ƒ๐˜€ ๐Ÿด๐—ธ โ‡’ ๐˜๐—ต๐—ถ๐˜€ ๐—ถ๐˜€ ๐—ฎ ๐—ด๐—ฎ๐—บ๐—ฒ-๐—ฐ๐—ต๐—ฎ๐—ป๐—ด๐—ฒ๐—ฟ ๐—ณ๐—ผ๐—ฟ ๐—ฅ๐—”๐—š ๐—ฎ๐—ป๐—ฑ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€.

  • ๐Ÿ“š Pretrained on 15T tokens, a more diverse training dataset than Llama-3 to reinforce multilinguality: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
  • ๐Ÿ”“ License: same as Llama-3, and on top of that it allows using the output data from Llama-3.1 for training other models (distillation)
  • โœจ One new role for the instruct version: on top of System, User, and Assistant, Ipython lets you write the output of a code tool call! This should work really well with Transformers agents ๐ŸŽ‰

Thanks a lot to Meta for this release which will make our lives better! ๐Ÿค—

This post is licensed under CC BY 4.0 by the author.