Post

Introducing Llama 3.1 - the most capable LLMs from Meta, free and arguably open-source!

 Llama 3.1 MMLU performance

Introducing Llama 3.1

Yesterdayโ€™s Llama 3.1 release marked a big milestone for LLM researchers and practitioners. Llama 3.1 405B is the biggest and most capable LLM with openly available LLMs. And particularly exciting is that the new Llama release comes with a 93-page research paper this time. Below, I want to share a few interesting facts from the paper, and I will likely write a longer analysis this weekend.

Meta announcement ๐Ÿ‘‰ https://ai.meta.com/blog/meta-llama-3-1/

True to our commitment to open source, starting today, weโ€™re making these models available to the community for download on llama.meta.com and HuggingFace and available for immediate development on our broad ecosystem of partner platforms.

Introducing Llama 3.1 - details

Model sizes

Llama 3.1 now comes in 3 sizes: 8B, 70B, and 405B parameters. The 8B and 70B variants are sight upgrades from the previous Llama 3 models that have been released in April 2024. (See the figure below for a brief performance comparison).

The 405B model was used to improve the 8B and 70B via synthetic data during the finetuning stages.

Pretraining Data

The 93-page report by Meta (a link to the report is in the comments below) offers amazing detail. Particularly, the section on preparing the 15.6 trillion tokens for pretraining offers so much detail that it would make it possible to reproduce the dataset preparation.

However, Meta doesnโ€™t share the dataset sources. All we know is that itโ€™s trained primarily on โ€œweb data.โ€ This is probably because of the usual copyright concerns and to prevent lawsuits.

Still, itโ€™s a great writeup if you plan to prepare your own pretraining datasets as it shares recipes on deduplication, formatting (removal of markdown markers), quality filters, removal of unsafe content, and more.

Long-context Support

The models support a context size of up to 128k tokens. The researchers achieved this via a multiple-stage process. First, they pretrained it on 8k context windows (due to resource constraints), followed by continued pretraining on longer 128k token windows.

In the continued pretraining, they increased the context length in 6 stages. Moreover, they also observed that finetuning requires 0.1% of long-context instruction samples; otherwise, the long-context capabilities will decline.

Alignment

In contrast to earlier rumors, Llama 3 was not finetuned using both RLHF with proximal policy optimization (PPO) and direct preference optimization (DPO). Following a supervised instruction finetuning stage (SFT), the models were only trained with DPO, not PPO. (Unlike in the Llama 2 paper, unfortunately, the researchers didnโ€™t include a chart analyzing the improvements made via this process.).

Although they didnโ€™t use PPO, they used a reward model for rejection sampling during the instruction finetuning stage.

Inference

The 405B model required 16k H100 GPUs for training.

During inference, the bfloat16-bit version of the model still requires 16 H100 GPUs. However, Meta also has an FP8 version that runs on a single server node (that is, 8xH100s).

Performance

You are probably curious about how it compares to other models.

The short answer is โ€œvery favorableโ€, on par with GPT4. Unfortunately, I exceeded the character limit for this LinkedIn post, so I will let the figure below speak for itself.

Summarization

It took months to train on 16,000 Nvidia H100 GPUs, resulting in a 405B parameter model with a 128K token context length, which, according to the benchmarks, is mostly superior to OpenAIโ€™s GPT-4.

Benchmarks can be biased; more parameters do not guarantee better performance. I guess the only way to figure out how amazing it is has to be real feedback from users over time.

The most exciting thing about LLaMA is that it is almost open-source, although there are some restrictions.

โžก๏ธLetโ€™s see whatโ€™s open-source and whatโ€™s not:

  • Commercial use is allowed, unless your app has over 700 million active users, in which case youโ€™ll need to obtain a license from Meta.

  • While the training data for LLaMA 3.1 is not open, the training code is publicly available. This consists of approximately 300 lines of Python and PyTorch, along with the FairScale library for distributed GPU training.

  • Another cool part is that the model weights are open. This can help developers build AI-powered apps. Instead of paying to use the GPT-4 API, you can now self-host your own model and pay a cloud provider a bunch of money to rent some GPUs.


 Llama 3.1 Ultra-Efficiently

๐Ÿฆฅ Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth AI

New comprehensive guide about supervised fine-tuning on Hugging Face.

Over the last year, Iโ€™ve done a lot of fine-tuning and blogging. This guide brings it all together. Here are the main takeaways:

  • How to efficiently fine-tune a Llama 3.1 model in Google Colab
  • When you should use fine-tuning and how it works
  • How to tune the hyperparameters, process datasets, etc.

 Llama 3.1 Technical Report

๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿญ ๐˜๐—ฒ๐—ฐ๐—ต๐—ป๐—ถ๐—ฐ๐—ฎ๐—น ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐—ฟ๐˜ - ๐—ฎ ๐˜๐—ฟ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฒ ๐˜๐—ฟ๐—ผ๐˜ƒ๐—ฒ ๐—ผ๐—ป ๐—Ÿ๐—Ÿ๐—  ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ๐—ถ๐—ป๐—ด ๐ŸŽ

I really advise you to remember the Llama 3.1 technical report published last week by Meta, for future reference whenever you need to build a SOTA LLM from scratch. Itโ€™s rare to see so much information in a technical report these days! Here are some of my takeaways.

๐—ฆ๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐—น๐—ฎ๐˜„๐˜€

๐Ÿง A question I had for a long time: are scaling laws reliable? I.e. can you really predict how model performance will grow as you increase compute spent?

โžก๏ธ The researchers have confirmed that you can derive scaling laws from smaller models: more precisely, the loss function of your best model will decrease linearly with the log of the compute that you spend to train it. For a clearer view, look at the figure below. This result has already been shown before (cf. Chinchilla paper) so no groundbreaking news here.

But computing these scaling laws is costly, so generally the experiments donโ€™t go very far up in compute, only up to 10^22 FLOPs (floating point operations). The question thus persists: โ€œare these scaling laws reliable for higher compute spent?โ€

๐Ÿ’ก What the report shows here is that the scaling laws computed up to 10^22 FLOPs actually do hold for much higher compute โœ… you can keep drawing a straight line on the graph up to over 10^25 FLOPs, four orders of magnitude higher, and you accurately predict the loss for the huge Llama-3.1-405B! ๐Ÿคฏ

This suggests one could keep drawing the line even further and get an idea of the performance of multi-trillion parameter models!

๐—ง๐—ผ๐—ผ๐—น ๐˜‚๐˜€๐—ฒ (๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ) ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด

Tool use training is still very new. Existing fine-tuning procedures are often limited to one tool call (so no multi-step calls), are limited to a specific syntaxโ€ฆ Here, the team has gone further:

  • โžค Used synthetically generated tool calling datasets with all 3 of these: Single tool call, nested calls (one tool call requiring the output of another) and parallel tool calls.
  • โžค Create both a single-call and a multi-step tool call dataset. Both exist in a preference version where the annotator picked the best answer, in order to perform DPO.

๐Ÿ› ๏ธ Specifically train their models for 3 tools: python code interpreter, Brave browser and Wolfram API: this is meaningful IMO since these tools are a strong basis for many agents problems.

๐—ข๐˜๐—ต๐—ฒ๐—ฟ ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€

  • ๐Ÿ’ฅ They used 16,000 H100 ๐Ÿฅฒ
  • ๐Ÿ› ๏ธ Parallelize the training in 4D: Tensor, pipeline, context parallelism, and FSDP.
  • โš™๏ธ For the post training, do not use any complciated RLHF pipeline like GPT-4, but simply several rounds of Supervised Fine-tuning (SFT) + DPO
This post is licensed under CC BY 4.0 by the author.