Meta has just released Llama 3.1, a significant advancement in open-source AI. The release includes a 405 billion parameter model, the most sophisticated open model to date, outperforming GPT-4 on several benchmarks.
It comes in 3 variants, 8b, 70b and 405b (each has base and instruct versions).
All are natively multilingual and has official tool calling support. The 405B model was used to improve the 8B and 70B via distillation & synthetic data during the finetuning stages. Multimodality is still work-in-progress.
Performance metrics highlights of Llama 3.1
-
405B: MMLU-Chat(General) 88.6 , GSM8K(Math) 96.8, HumanEval(code) 89. These are at par with GPT4o
-
Specifically, the 405B beats GPT4o on ARC Challenge (Reasoning), GSM8K, Nexus(tool use), ZeroSCROLLS/QuALITY(Long Context) and Multilingual MGSM benchmarks.
-
70B: MMLU-Chat 86, GSM8K 95.1, HumanEval 80
-
8B: MMLU-Chat 73, GSM8K 84.5 , HumanEval 72.6, substantial improvement over Llama 3 8B
License of Llama 3.1
Permissively licensed, including commercial use (unless you exceed 700m monthly users), synthetic data generation, distillation and finetuning.
Architecture and Training details
All the 3 models were trained on 15T tokens and a synthetic data pipeline and uses a standard dense Transformer architecture. Some more techniques requiring special mention are
-
Grouped query attention (GQA) with 8 key-value heads
-
Vocabulary with 128K tokens
-
RoPE base frequency hyperparameter increased to 500,000
The models can handle up to 128k tokens of context. This was achieved through a multi-stage process: initial pretraining on 8k token windows due to resource limits, followed by continued pretraining that gradually increased the context length to 128k tokens over six stages.
Llama 3's finetuning process involved supervised instruction tuning (SFT) followed by direct preference optimization (DPO). Unlike some models, it did not use reinforcement learning with human feedback (RLHF) or proximal policy optimization (PPO).
Meta has also published an exhaustive 92 page paper for Llama 3.1 covering details of pretrainining data, filtering, annealing, synthetic data, scaling laws, infrastructures, parallelism, training recipes, post-training adaptation, benchmarking, inference strategies, quantization etc.
Where is it good at
With 128K context window Llama 3.1 will be great for RAG applications. And the main strength of 405B model is that it's ideal for distilling smaller, task-specific expert models. So from synthetic data generation to model distillation, the possibilities are limitless with Llama 3.1
Model distillation transfers knowledge from a large teacher LLM to a smaller student model, aiming to maintain performance while reducing computational requirements.
The process typically involves training the student model to mimic the output distribution of the teacher model, often using softmax with temperature scaling to emphasize informative soft targets.
What is still missing
The current released version of Llama-3.1 is not yet multimodal. The image, video, and speech capabilities are integrated into Llama 3.1. However, these models are under development and not yet broadly released.
Model pricing
Among the API providers, Octo.ai offers Llama 3.1 405B at $3/M input tokens and $9/M output tokens, compared to GPT4-0's $5/M and $15/M respectively.
Access
|