Gemma 3 — Model and Fine-tuning Overview

Overview and Positioning

Gemma 3 is a family of multimodal language models developed by Google DeepMind and the Gemma Team with a focus on long-context processing, multilingual capabilities, and visual understanding. Key capabilities and problems addressed include:

Multimodal understanding
Long context processing
Multilingual capabilities
Image safety classification

The work responds to shortcomings in prior multimodal and long-context models: memory growth of KV caches with long contexts, artifacts when processing non-square and high-resolution images, and evolving safety risks introduced by larger, more capable open models. Major contributions include support for 128K context tokens, an adaptive windowing algorithm for image processing, improved multilingual data mixtures, integration of image-to-text capabilities, decontamination techniques, and enhanced internal safety processes including development of ShieldGemma 2.

Architecture and Design Choices

Gemma 3 combines language and vision design elements. The language component is a decoder-only transformer, and the visual component uses a Vision Transformer as a frozen vision encoder shared across models. Overall the families are dense models (no Mixture-of-Experts reported).

Notable architectural and design choices:

Interleaving of local and global attention layers with a reported ratio change: Local:Global attention ratio of 5:1 in Gemma 3 models (previously 1:1 in Gemma 2 models).
Layer structuring: "5 local layers for every global layer".
Short span local attention limited to 1024 tokens.
Grouped-Query Attention (GQA) to improve the efficiency of cross-attention patterns.
Increased RoPE base frequency from 10k to 1M for global self-attention layers in Gemma 3.
Vision encoder is frozen, uses a fixed input resolution of 896 x 896, and uses average pooling to reduce visual outputs to 256 tokens before fusion.
Architecture improvements emphasize compatibility with standard hardware and are tailored for both performance and high-resolution visual tasks.

Model Variants and Context Capacity

Gemma 3 is provided in multiple sizes and at least one instruction-tuned line. Reported parameter sizes across related entries include 1B, 4B, 12B, and 27B. Instruction-tuned variants are noted, for example Gemma3-27B-IT (also referenced as Gemma34B-IT / Gemma-3-27B-IT in variant lists).

Context length capabilities:

Certain Gemma 3 variants support up to 128K context tokens.
The 1B model variant is reported with 32K context tokens.
Other reported max context values include 32768 (appearing in variant summaries).

Model naming and variant metadata are heterogeneous across reports; the core families center on the 1B/4B/12B/27B sizes and an instruction-tuned 27B variant.

Tokenization and Input Formatting

Tokenization and prompt formatting follow established practices:

Tokenizer: SentencePiece, noted as the same tokenization approach used in Gemini 2.0.
Vocabulary size reported as both 256k and 262000 in different entries.
Chat/prompt template markers include tokens such as "user", "model", and "".
System prompt note: explicitly add the [BOS] token after tokenization, or use the add_bos=True option in the tokenizer.

Pretraining, Fine-tuning, and Alignment

Pretraining:

Total pretraining token counts reported include 2000000000, 4000000000, 12000000000, and 14000000000 (multiple totals listed).
Data strategy includes "Revisiting data mixture to improve multilingual capabilities."
Training objectives highlighted include knowledge distillation.
Optimizer and distribution notes: ZeRO-3 was used.
Compute: training performed on TPUv4, TPUv5e, and TPUv5p infrastructure. Pre-training used sequences of 32K that were scaled to 128K tokens.

Post-training and alignment:

Supervised fine-tuning (SFT) / instruction tuning: reported as used ("yes").
SFT objectives emphasized improving mathematics, reasoning, and chat abilities.
Preference alignment methods reported include knowledge distillation, reinforcement learning, and RLHF.

Evaluation and Benchmark Results

Headline evaluation claims:

Gemma34B-IT reported as competitive with Gemma2-27B-IT.
Gemma3-27B-IT reported as comparable to Gemini-1.5-Pro.
Gemma-3-27B-IT ranked 9th with an Elo score of 1338.
Low violation rate on safety policies reported for Gemma 3.

Selected benchmark results and notes (results reported as provided; comparators listed where present):

LMSys Chatbot Arena — Metric: Elo score. Result summary: "Preliminary results received on March 8, 2025."
Standard benchmarks — Metric: Elo score. Result summary: "Gemma-3-27B-IT has a score of 1338, outperforming models like DeepSeek-V3 (1318) and LLaMA 3 405B (1257)." Comparators: DeepSeek-V3, LLaMA 3 405B, Qwen2.5-70B.
MMLU-Pro — Metric: Score. Result summary: 67.3, 75.8, 77.6, 79.1, 15.6, 46.8, 56.9, 14.7, 43.6, 60.6, 67.5. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
LiveCodeBench — Metric: Score. Result summary: 30.7, 34.2, 34.5, 36.0, 1.2, 10.8, 20.4, 1.9, 12.6, 24.6, 29.7. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
Bird-SQL (dev) — Metric: Score. Result summary: 45.6, 54.4, 58.7, 59.3, 12.2, 33.8, 46.7, 6.4, 36.3, 47.9, 54.4. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
GPQA Diamond — Metric: Score. Result summary: 51.0, 59.1, 60.1, 64.7, 24.7, 28.8, 34.3, 19.2, 30.8, 40.9, 42.4. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
SimpleQA — Metric: Score. Result summary: 8.6, 24.9, 29.9, 44.3, 2.8, 5.3, 9.2, 2.2, 4.0, 6.3, 10.0. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
FACTS Grounding — Metric: Score. Result summary: 82.9, 80.0, 84.6, 82.8, 43.8, 62.0, 62.4, 36.4, 70.1, 75.8, 74.9. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
Global MMLU-Lite — Metric: Score. Result summary: 73.7, 80.8, 83.4, 86.5, 41.9, 64.8, 68.6, 34.2, 54.5, 69.5, 75.1. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
MATH — Metric: Score. Result summary: 77.9, 86.5, 90.9, 91.8, 27.2, 49.4, 55.6, 48.0, 75.6, 83.8, 89.0. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
HiddenMath — Metric: Score. Result summary: 47.2, 52.0, 63.5, 65.2, 1.8, 10.4, 14.8, 15.8, 43.0, 54.5, 60.3. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
MMMU (val) — Metric: Score. Result summary: 62.3, 65.9, 71.7, 72.7, -, -, -, -, 48.8, 59.6, 64.9. Comparators: Gemini 1.5, Gemini 2.0, Gemma 2, Gemma 3.
DocVQA (resolution sweep) — Metric: accuracy. Result summary: 31.9 at 256 resolution, 45.4 at 448 resolution, 59.8 at 896 resolution.
InfoVQA (resolution sweep) — Metric: accuracy. Result summary: 23.1 at 256 resolution, 31.6 at 448 resolution, 33.7 at 896 resolution.
TextVQA (resolution sweep) — Metric: accuracy. Result summary: 44.1 at 256 resolution, 53.5 at 448 resolution, 58.0 at 896 resolution.
Document and multimodal results by model size and with prompt & scale (P&S):
DocVQA — Metric: accuracy. Result summary: 72.8 for 4B, 81.0 for 4B w/ P&S, 85.6 for 27B, 90.4 for 27B w/ P&S.
InfoVQA — Metric: accuracy. Result summary: 44.1 for 4B, 57.0 for 4B w/ P&S, 59.4 for 27B, 76.4 for 27B w/ P&S.
TextVQA — Metric: accuracy. Result summary: 58.9 for 4B, 60.8 for 4B w/ P&S, 68.6 for 27B, 70.2 for 27B w/ P&S.
Baseline assurance evaluations — Metric: Model violation rate. Result summary: Low violation rate on safety policies.
Chemical, Biological, Radiological and Nuclear (CBRN) knowledge evaluation — Metric: Knowledge evaluation. Result summary: Low knowledge in biological, radiological, and nuclear risks.

Where the models excel includes math, coding, chat, instruction following, and multilingual capabilities; Gemma variants have been reported to outperform some larger models on Elo score and show better performance with higher-resolution encoders, particularly on STEM-related tasks.

Safety, Limitations, and Open Questions

Reported limitations and caveats:

Artifacts when processing non-square aspect ratios and high-resolution images are noted.
Elo scores do not account for visual abilities, therefore comparisons using Elo alone are incomplete.
Risk of contamination of probes despite decontamination techniques is acknowledged.
Baseline CBRN evaluation indicates low knowledge in chemical hazards and in biological, radiological, and nuclear risks.

Additional notable claims and metrics:

"memorization rate significantly lower than prior models" is reported alongside a claim that "approximately memorized text increased by roughly 24x on average" (both statements appear as reported figures).

Ethical and safety-oriented processes described include decontamination techniques for training data and enhanced internal safety processes (including ShieldGemma 2) to mitigate evolving risks from more capable and multimodal models.

Summary

Gemma 3 is a multimodal, dense transformer family focused on long-context and visual capabilities with instruction-tuned variants and models spanning 1B to 27B parameters. Architectural innovations—interleaved local/global attention, grouped-query attention, RoPE frequency changes, and a frozen vision encoder at high resolution—support high-resolution visual tasks and extended context. Training combines large-scale pretraining (multiple total token counts), knowledge distillation, ZeRO-3 optimization, and post-training alignment (SFT, RLHF). Evaluation shows competitive performance on a wide set of benchmarks, strong STEM and multilingual results, good safety-policy compliance, and known limitations in image aspect handling and CBRN domain knowledge. Careful interpretation of metrics that omit visual abilities (such as Elo) is advised when comparing multimodal models.

Sources

https://arxiv.org/abs/2503.19786v1