Llama 4

Overview

Llama 4 is a multimodal, long-context model family released by Meta in April 2025 (reported dates: April 2025; April 5, 2025). It targets joint processing of text and images with text and code outputs, aiming to preserve cross-modal capabilities while improving reasoning, coding, and conversational quality. The family emphasizes sparse, long-context processing through a Mixture-of-Experts backbone and an architecture designed for length generalization and inference efficiency.

The development program highlights a staged training pipeline (pretraining, mid-training for context extension, and a post-training pipeline) combined with curriculum strategies for mixing modalities and online alignment methods. A training-stabilization approach referred to as MetaP is reported to assist hyperparameter selection.

Architecture and design

The architecture is built on a Transformer with a Mixture-of-Experts (MoE) backbone and several multimodal and long-context design choices. The backbone interleaves dense and MoE layers to improve inference efficiency while maintaining capacity, and it uses an early fusion approach to integrate vision and language natively. Attention-layer design mixes layers without positional embeddings and RoPE-based layers, and additional mechanisms such as iRoPE and inference-time attention scaling are reported.

Key design and research contributions include:

Introduction of the Llama 4 model family with Scout and Maverick variants.
First Llama generation to use a mixture-of-experts backbone with alternating dense and MoE layers.
Early-fusion vision-language processing enabling native multimodality.
Long-context architecture choices to support extreme context windows and length generalization.
A training program comprising pretraining, mid-training for context extension, and post-training stages.
A curriculum strategy and continuous online RL procedures (including multimodal online RL biased toward harder prompts) to mix modalities without sacrificing single-modality performance.

Variants and capabilities

The family includes multiple variants and operating points. Two flagship variants with specific reported specs:

Llama 4 Scout: "17B active parameters", "109B total parameters", "16 experts"; supports a "10M tokens" maximum context in reported data. Language support is reported both as a 12-language subset (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) and a broader claim of "200 languages, including more than 100 languages with at least 1B tokens each." Scout is offered in both base and instruction-tuned forms.
Llama 4 Maverick: "17B active parameters", "400B total parameters", "128 experts"; supports a "1M tokens" maximum context in reported data. Language reporting mirrors Scout: a 12-language named subset and the broader "200 languages" claim. Maverick is offered in both base and instruction-tuned forms and is positioned to emphasize performance-per-cost.

Multiple naming variants appear in reporting (e.g., "L4 Maverick", "L4 Scout", "L3.3 70B", "L3.1 405B") where instruction-tuned and base variants are noted; the primary family labels in evaluation reporting are Scout and Maverick.

Training and fine-tuning approach

Pretraining totals are reported as multiple figures for different variants and training stages: "∼ 40T tokens for Scout", "∼ 22T tokens for Maverick", "> 30T tokens for Behemoth", "~ 40T for Scout", and "~ 22T for Maverick". Data for pretraining is described as a "Mixture of publicly available data, licensed data, and Meta's products." Language coverage is reported as "200 total languages reported" (also listed as "200").

Important hyperparameters and tuning practices called out include per-layer learning rates and initialization scales. Compute for the two main variants is reported as "7.38M H100-80GB GPU-hours across Scout and Maverick."

Fine-tuning and alignment steps are lightweight and targeted:

Post-training supervised fine-tuning is described as "Lightweight SFT" focused on a harder subset of data.
Preference alignment is reported as "lightweight direct preference optimization."
Continuous online RL is used with alternating model updates and ongoing prompt filtering; a multimodal online RL stage is biased toward harder prompts.

Instruction-tuned releases are available for both Scout and Maverick; instruction tuning is reported to yield large gains on multimodal evaluations.

Evaluation and benchmark highlights

Headline positioning contrasts Scout and Maverick: Scout is reported as "'best in its class' with 10M context," while Maverick is positioned for stronger performance-per-cost and leads Scout on several reasoning and coding benchmarks.

Selected benchmark results and reported metrics include (variants listed as reported):

LMArena (ELO): "1417 for Maverick".
MMLU (macro avg/acc char): "L3.1 70B: 79.3", "L3.1 405B: 85.2", "L4 Scout: 79.6", "L4 Maverick: 85.5".
MMLU-Pro (macro avg/em): "L3.1 70B: 53.8", "L3.1 405B: 61.6", "L4 Scout: 58.2", "L4 Maverick: 62.9".
MATH (em maj1@1): "L3.1 70B: 41.6", "L3.1 405B: 53.5", "L4 Scout: 50.3", "L4 Maverick: 61.2".
MBPP (pass@1): "L3.1 70B: 66.4", "L3.1 405B: 74.4", "L4 Scout: 67.8", "L4 Maverick: 77.6".
TyDiQA (average/f1): "L3.1 70B: 29.9", "L3.1 405B: 34.3", "L4 Scout: 31.5", "L4 Maverick: 31.7".
ChartQA (relaxed accuracy): multiple reported points including "L4 Scout: 83.4" and "L4 Maverick: 85.3"; elsewhere reported as "L4 Scout: 88.8" and "L4 Maverick: 90.0". Reporting notes that earlier L3 models had "No multimodal support for L3.1 70B and L3.1 405B".
DocVQA (ANLS): "L4 Scout: 89.4", "L4 Maverick: 91.6"; test results also reported as "L4 Scout: 94.4" and "L4 Maverick: 94.4".
MMMU (accuracy): "L4 Scout: 69.4", "L4 Maverick: 73.4" with L3 variants listed as "No multimodal support".
MMMU Pro (accuracy): "L4 Scout: 52.2", "L4 Maverick: 59.6".
MathVista (accuracy): "L4 Scout: 70.7", "L4 Maverick: 73.7".
LiveCodeBench (pass@1, 10/01/2024-02/01/2025): "L3.3 70B: 33.3", "L3.1 405B: 27.7", "L4 Scout: 32.8", "L4 Maverick: 43.4".
MMLU Pro (macro avg/acc): "L3.3 70B: 68.9", "L3.1 405B: 73.4", "L4 Scout: 74.3", "L4 Maverick: 80.5".
GPQA Diamond (accuracy): "L3.3 70B: 50.5", "L3.1 405B: 49.0", "L4 Scout: 57.2", "L4 Maverick: 69.8".
MGSM (average/em): "L3.3 70B: 91.1", "L3.1 405B: 91.6", "L4 Scout: 90.6", "L4 Maverick: 92.3".
MTOB (half book) eng → kgv / kgv → eng (chrF): context window is "128K"; "L4 Scout: 42.2/36.6", "L4 Maverick: 54.0/46.4".
MTOB (full book) eng → kgv / kgv → eng (chrF): context window is "128K"; "L4 Scout: 39.7/36.3", "L4 Maverick: 50.8/46.7".

Where it wins: reporting emphasizes that Maverick leads Scout on reasoning and coding benchmarks and that instruction tuning yields large gains on multimodal evaluations.

Limitations and deployment considerations

Several practical limitations and caveats are reported:

Serving the models can require loading full model weights; for Scout-class weights this is reported as "'over 200 GB'".
Multi-image handling is implementation- and endpoint-dependent.
Real-world performance depends on prompt formats, inference kernels, and serving constraints.
The effective context window and available memory budget on a given platform are limiting factors for long-context applications.

Early fusion multimodality and very large context windows bring implementation and deployment trade-offs that impact latency, memory, and system design.

Sources

https://arxiv.org/abs/2601.11659v1