Qwen3 — Model Family and Fine-tuning Approach

Overview

Qwen3 is a multi-sized large language model family designed to advance performance, efficiency, and multilingual capabilities across reasoning, coding, mathematics, agent-related tasks and long-context processing. The family emphasizes dynamic control between thinking and non-thinking modes to reduce the cost and complexity of deploying separate models for different types of tasks, and introduces mechanisms for adaptive computational allocation during inference and fine-tuning.

Key positioning claims include the integration of thinking and non-thinking modes into a unified framework, a novel thinking budget mechanism, expanded multilingual coverage to 119 languages and dialects, and state-of-the-art results across numerous public benchmarks.

Model variants and sizes

The Qwen3 family spans tiny to very large variants. Representative sizes and configurations documented include:

Small and mid sizes: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B.
Large and MoE configurations: Qwen3-30B-A3B (MoE with 30B total, 3B activated params), Qwen3-32B (dense 32B), and Qwen3-235B-A22B (235 billion total parameters with 22 billion activated parameters per token).
Context length capability: some variants report max context tokens of "32K" and "128K"; a specific high-capacity variant lists "32,768".
Language coverage: multiple variants and the family as a whole list support for 119 languages and dialects.

Where stated, the largest production variant is Qwen3-235B-A22B with "# Total Params: 235B" and "# Activated Params: 22B", and a maximum context token length of "32,768" in at least one reported configuration.

Architecture and notable design choices

The Qwen3 family includes both dense and Mixture-of-Expert (MoE) architectures and reports a range of architectural hyperparameters and design elements:

Layer counts mentioned: 28, 36, 40, 48, 64, 94.
Attention head configurations: 16, 32, 40, 64; kv_heads reported as 8 and 4.
Notable techniques and choices: Grouped Query Attention (GQA), SwiGLU activation, Rotary Positional Embeddings (RoPE) with an increased base frequency via ABF from 10,000 to 1,000,000, RMSNorm with pre-normalization, QK-Norm in attention, fine-grained expert segmentation for MoE, and global-batch load balancing loss.
Sequence-length extensions: introduction of YARN and Dual Chunk Attention (DCA) to increase sequence length capacity.
Mode control: dynamic mode switching governed by chat templates or user queries, with a chat template design that embeds /think and /no think flags and an "empty thinking block" mechanism for non-thinking responses.
Training flow: a documented four-stage training process focusing on both 'thinking' and 'non-thinking' functionalities and later Stage 3 / Stage 4 refinements.

The architecture emphasizes cost-efficiency: Qwen3 MoE configurations are said to improve performance with fewer activated parameters, reducing inference and training costs relative to some alternatives.

Thinking vs non-thinking modes and prompt format

A core innovation is explicit control over internal "thinking" behavior. Two primary modes are exposed to the user via chat-template flags:

/think enables the model to produce an internal thinking trace alongside the final response.
/no think requests no internal thinking trace; the assistant emits an empty thinking block and only the response.

Two example chat-template formats provided:

"<|im start|>user { query } /think<|im end|> <|im start|>assistant { thinking content } { response } <|im end|>"
"<|im start|>user { query } /no think<|im end|> <|im start|>assistant { response } <|im end|>"

Operational controls include a thinking budget set to "8192 tokens" to mitigate verbosity and manage computational resource allocation across tokens during complex thinking tasks. The system supports intermediate cases in the thinking process and provides user-level control over whether internal chain-of-thought is generated or omitted.

Tokenizer and prompt formatting

The tokenizer and input formatting details reported are:

Tokenizer: byte-level byte-pair encoding (BBPE) and use of Hugging Face's tokenizer.
Vocab size: "151,669".
Chat template / prompt formats include the explicit use of /think and /no think flags as illustrated above.

Training methodology and data mixture

Pretraining:

Total pretraining tokens reported across sources include: "36 trillion", "30 trillion", "5 trillion", and "hundreds of billions".
Data mixture: multi-modal and synthetic components are listed, including use of Qwen2.5-VL for text extraction from PDFs, synthetic data from Qwen2.5-Math for mathematical content and Qwen2.5-Coder for code-related data, plus high-quality coding, STEM, reasoning tasks, books, and multilingual text corpora.
Specific dataset construction notes: an SFT dataset combines thinking and non-thinking data; thinking data generation uses rejection sampling on Stage 1 queries; non-thinking data covers diverse tasks such as coding, mathematics, instruction-following, multilingual tasks, creative writing, question answering, and role-playing.
Optimizer/schedule: an accelerated learning rate decay during the Reasoning Stage is reported.
Compute: one claim states the workflow "Requires only 1/10 of the GPU hours compared to the four-stage training method".

Post-training and alignment:

Supervised fine-tuning (SFT): used ("Yes") with mentions of continual SFT on the Reasoning RL model.
Distillation and RL variants: the family documents multiple distillation strategies (off-policy, on-policy) and reinforcement learning variants with reported performance differences (see Evaluation). The Strong-to-Weak Distillation approach is highlighted as a method for optimizing lightweight models and enabling 1/10 activated parameter efficiency in some distilled models.

Distillation, fine-tuning, and inference budgeting

Several approaches are documented:

Strong-to-Weak Distillation as an explicit technique for endowing lightweight models with strong reasoning capabilities while keeping most parameters inactive per token (claims include "1/10 activated parameters").
Distillation variants and RL: off-policy, on-policy, and reinforcement learning are presented with benchmarked outcomes for AIME and MATH500 (see Evaluation subsection).
Thinking budgets and mode-aware chat templates are part of the inference-time control and have been integrated into training recipes to align behavior across thinking and non-thinking modes.

Evaluation and benchmark highlights

Evaluation results are extensive. Headline performance for the largest reported variant, Qwen3-235B-A22B, includes:

AIME'24: "85.7"
AIME'25: "81.5"
LiveCodeBench v5: "70.7"
CodeForces: "2,056"
BFCL v3: "70.8"

Aggregated benchmark comparisons (representative items pulled from reported results):

MMLU score for Qwen3-235B-A22B: "87.81" (comparators listed include Qwen2.5-72B: "86.06", DeepSeek-V3: "87.19", etc.).
GSM8K: Qwen3-235B-A22B "94.39" (comparators include Qwen2.5-72B "91.50").
MATH: Qwen3-235B-A22B "71.84".
EvalPlus: Qwen3-235B-A22B "77.60".
Across many benchmarks, Qwen3 variants (both thinking and non-thinking configurations) are reported to outperform numerous baselines (e.g., Qwen2.5 variants, DeepSeek-V3, Gemma-3, LLaMA-4) in multiple configurations and on many tasks.

Comparative and mode-specific claims:

Qwen3-235B-A22B (Thinking) reportedly outperforms DeepSeek-R1 on 17/23 benchmarks.
Qwen3-235B-A22B (Non-thinking) reportedly surpasses GPT-4o-2024-11-20 in 18/23 benchmarks.
Qwen3-32B (Thinking) reportedly outperforms QwQ-32B on 17/23 benchmarks, and Qwen3-32B (Non-thinking) performs on par with Qwen2.5-72B-Instruct in some reported comparisons.

Distillation and training regime benchmarking:

Pass@64 results and MATH500 scoring are reported across Off-policy Distillation, Reinforcement Learning, and On-policy Distillation; examples:
AIME'24 pass@64: Off-policy Distillation "90.0", Reinforcement Learning "90.0", On-policy Distillation "93.3".
MATH500 score: Off-policy Distillation "92.4", Reinforcement Learning "94.8", On-policy Distillation "97".

Where it wins: the model family highlights strengths in state-of-the-art results across diverse benchmarks, coding and mathematics tasks, agent-related tasks, STEM reasoning, enhanced long-context processing, and multilingual capabilities.

Where it is weaker: reported weaknesses include "Degradation in performance for complex tasks after certain training stages" and "Performance slightly degrades in thinking mode" in some contexts.

Key contributions (selected)

Integration of thinking mode and non-thinking mode into a unified framework.
Introduction of a thinking budget mechanism for adaptive computational resource allocation.
Expansion of multilingual support from 29 to 119 languages and dialects.
Strong-to-Weak Distillation for optimizing lightweight models and enabling 1/10 activated parameter operation in some configurations.
Pre-trained on "36 trillion" tokens (among other reported totals across sources).

Limitations, caveats, and open questions

Reported limitations and caveats include:

"Performance trade-off for enhanced overall versatility" — a general note that maximizing versatility can induce trade-offs.
Documented cases of degraded performance on complex tasks after specific training stages.
Slight performance degradation in thinking mode for some tasks, indicating mode-specific tuning challenges.
Many reported numbers and comparisons depend on variant, training stage, and whether thinking or non-thinking mode was used; results vary across datasets and comparators.

Notable numbers and operational facts

Vocab size: "151,669".
Thinking budget: "8192 tokens".
Pretraining tokens reported: "36 trillion", "30 trillion", "5 trillion", "hundreds of billions".
Over "20 distinct tasks covered in the reward system".
Compute claim: "Requires only 1/10 of the GPU hours compared to the four-stage training method".

Summary of strengths

Qwen3 emphasizes a mode-aware design that provides user-level control between internal chain-of-thought and compact responses, broad multilingual coverage ("119 languages and dialects"), MoE options for inference cost reduction, multiple distillation strategies for creating lightweight high-performing models, and strong benchmark performance across mathematics, coding, reasoning, and multilingual evaluations.

Sources

https://arxiv.org/abs/2505.09388v1