Gemini 3 Pro — Overview of the model and fine-tuning approach

Summary and positioning

Gemini 3 Pro is a Google-developed large model positioned to tackle complex tasks that require enhanced reasoning, long-context understanding, creativity, and agentic behavior. Two reported release dates are provided: January 2025 and November 2025. The model emphasizes native multimodal support for text, vision, and audio, and introduces specialized capabilities intended to improve problem-solving and safety-related evaluations.

Primary strengths described include Deep Think mode for enhanced problem-solving, advanced coding and algorithmic development, long-context comprehension, and agentic performance that can coordinate multi-step or tool-using behaviors. The design is explicitly aimed at addressing problems that involve synthesizing information from multiple sources and adapting to real-world complexity, including step-by-step improvements in reasoning and safety/tone for automated evaluations.

Complex tasks and problem solving across modalities
Natively multimodal (text, vision, audio) inputs and long-context understanding
Enhanced reasoning via Deep Think mode and specialized training signals
Agentic capabilities and advanced coding performance
Focus on safety and automated evaluation tone improvements

Architecture and design choices

The model architecture is Transformer-based and uses a Sparse mixture-of-experts (MoE) design. Key architectural design choices include:

Dynamic routing of tokens to a subset of parameters (experts), enabling selective computation per token.
Decoupling of total model capacity from per-token computation and serving cost, allowing large capacity while controlling runtime compute.
The model is explicitly described as MoE rather than a dense transformer.

No detailed layer counts, hidden sizes, or attention/MLP head counts are provided. The emphasis in the architecture is on routing and capacity/compute trade-offs rather than on single-stack dense parameter counts.

Training data, objectives, and compute

Pretraining used a large-scale, diverse mixture of data that includes publicly-available web documents and multimodal content: text, code, images, audio, and video. Training objectives and tooling include:

Use of reinforcement learning techniques oriented toward multi-step reasoning.
Inclusion of problem-solving and theorem-proving data to support algorithmic and reasoning capabilities.
Training infrastructure and software included JAX and ML Pathways.
Compute hardware: trained using Google’s Tensor Processing Units (TPUs).

Details such as total token count, language coverage counts, or other hyperparameters are not provided. Post-training fine-tuning details (supervised fine-tuning and preference alignment pipelines) are not specified in the available material.

Evaluation highlights and benchmark performance

A headline claim is that the model “Significantly outperforms Gemini 2.5 Pro across a range of benchmarks.” The evaluation suite spans mathematics, multimodal reasoning, coding and agentic tasks, screen and document understanding, and automated safety metrics. Where specified, numeric results are preserved exactly as reported.

Mathematics and reasoning:

Mathematics (with code execution): 1008
MathArena Apex (Challenging Math Contest problems): 23.48, 0.58

Multimodal understanding and document/screen/video comprehension:

MMMU-Pro (Multimodal understanding and reasoning): 81.08, 68.08, 76.08
ScreenSpot-Pro (Screen understanding): 72.78, 11.48, 36.2%, 3.58
CharXiv Reasoning (Information synthesis from complex charts): 81.48, 69.68, 68.58, 69.58
OmniDocBench 1.5 (OCR): 0.115, 0.145, 0.145, 0.147
Video-MMMU (Knowledge acquisition from videos): 87.68, 83.68, 77.88, 80.48

Coding, agentic, and tool-use benchmarks:

LiveCodeBench Pro (Competitive coding problems from Codeforces, ICPC; and I0I): 2,439, 1,775, 1,418, 2,243
Terminal-Bench 2.0 (Agentic terminal coding): 54.28, 32.68, 42.88, 47.68
SWE-Bench Verified (Agentic coding): 76.28, 59.68, 77.28, 76.38
[2-bench (Agentic tool use): 85.48, 54.98, 84.78, 80.2%
Vending-Bench 2 (Long-horizon agentic tasks): 85,478.16, 8573.64, $3,838.74, $1,473.43

Other task suites and specialized metrics:

FACTS Benchmark Suite (Held out internal grounding' parametric; MM, and search retrieval benchmarks): 70.58, 63.48, 50.48, 50.88
SimpleQA Verified (Parametric knowledge): 72.18, 54.58, 29.38, 34.98
MMMLU (Multilingual Q6A): 91.88, 89.58, 89.1%, 91.08
Global PIQA (Commonsense reasoning across 100 Languages and Cultures): 93.48, 91.58, 90.1%, 90.98
MRCR v2 (8-needle) (Long context performance): 77.08, 26.38, 58.08, 16.48, not supported, 61.68 not supported

Automated safety and refusal tone evaluations (comparisons to Gemini 2.5 Pro noted where provided):

TeText to TeText Safety (Automated content safety evaluation): -10.4% (comparator: Gemini 2.5 Pro)
Multilingual Safety (Automated safety policy evaluation across multiple languages): +0.2% (non-egregious) (comparator: Gemini 2.5 Pro)
Image to TeText Safety (Automated content safety evaluation): +3.1% (non-egregious) (comparator: Gemini 2.5 Pro)
ToTone 2 (Automated evaluation measuring objective tone of model refusal): +7.7% (comparator: Gemini 2.5 Pro)
Unjnjustified - refusa (Automated evaluation measuring model’s ability to respond to borderline prompts while remaining safe): +3.7% (non-egregious) (comparator: Gemini 2.5 Pro)

Where the model is reported to win most clearly: enhanced reasoning and multimodal capabilities.

Limitations, caveats, and open questions

The model exhibits several limitations consistent with foundation models and additional operational caveats:

Exhibits general limitations of foundation models, such as hallucinations.
Occasional slowness or timeout issues.
Jailbreak vulnerability (improved compared to Gemini 2.5 Pro but still an open research problem).
Possible degradation in multi-turn conversations.

These limitations are presented as active areas for research and operational consideration.

Practical implications and design trade-offs

The combination of a Sparse mixture-of-experts (MoE) architecture and multimodal pretraining aims to increase representational capacity without linearly increasing per-token serving compute. The use of reinforcement learning techniques targeted at multi-step reasoning and inclusion of theorem-proving/problem-solving data suggests focus on algorithmic reasoning and advanced coding capabilities. Training on TPUs via JAX and ML Pathways reflects an investment in scalable infrastructure and modern ML toolchains.

In deployment planning, expect trade-offs between high-capacity inference behavior (via experts) and latency/timeout management, plus ongoing attention to safety alignment and jailbreak mitigation strategies.

Organizations and attribution

The project and model are attributed to Google and marketed under the name Gemini 3 Pro.

Sources

Gemini 3 Pro Model Card