DeepSeek-V3.2 (DeepSeek-AI)
Overview
DeepSeek-V3.2 is an agentic, long-context model family developed by DeepSeek-AI with a focus on balancing computational efficiency and advanced reasoning for agentic and code-centric workflows. The project emphasizes sparse attention and modular routing to reduce inference cost on long sequences while supporting extended chain-of-thought reasoning, tool use, and automated environment setup for software issue resolution. Key methodological highlights include the introduction of DeepSeek Sparse Attention (DSA) under a MLA framework, mixed reinforcement learning training using Group Relative Policy Optimization (GRPO), and routing-preserving strategies in the inference and training pipelines.
- Context window extended to 128K tokens and context management strategies to extend token budgets at test time.
- Agentic task synthesis: generated large-scale executable environments and diverse problem sets for code and agent evaluation.
- Efficiency + reasoning: activates only a subset of expert modules during inference to improve computation vs. dense attention trade-offs.
- Training emphasis: increased RL training budget and routing-preserving mechanisms to stabilize MoE-style behavior.
Variants and Context Capacity
The family includes multiple labeled variants with overlapping names and experimental labels. Consolidated variant information:
- DeepSeek-V3.2 — primary release identifier. Languages explicitly listed: Python, Java, JavaScript, TypeScript, C, C++, Go, PHP.
- DeepSeek-V3.2-Speciale — positioned as a higher-performing variant in evaluations; shares the V3.2 name in some listings.
- DeepSeek-V3.2-Exp — experimental variant; explicitly reported with a max context token capacity of 128K.
Variants are used to denote experimental vs. production or instruction-tuned versions; one variant name appears as DeepSeek-V3.2-SFT indicating an SFT-labelled post-training artifact. Multiple entries reference 128000 or 128K as the extended context budget.
Architecture and Design
The design centers on a hybrid approach combining sparse attention with modular routing:
- Core architecture: DSA under MLA — DeepSeek Sparse Attention implemented under a MLA-style framework.
- Dense and MoE modes are both noted: architecture supports both Dense and MoE configurations depending on use case.
-
Notable design choices include:
-
A lightning indexer and fine-grained token selection mechanism.
- Key-value entries shared across multiple queries.
- Use of different models or modes for thinking and non-thinking operation.
- Preservation and enforcement of identical expert routing paths during training and inference; the Keep Routing operation was adopted in the RL training pipeline.
- Context management strategies and a length-constraint reward model to control generation trajectories.
- Temperature set to 1.0 in tool-use evaluations and context window configured to 128K tokens where applicable.
- Automated environment-setup agent (package installation and dependency resolution) and Jupyter Notebook integration for code interpretation.
Tokenization and Prompting
Prompting practices emphasize stepwise reasoning and explicit final-answer formatting:
- The primary chat template / prompt format used: "{question}\nPlease reason step by step, and put your final answer within \boxed{}."
- No explicit tokenizer type or vocabulary size was provided.
- System prompts and other prompt-engineering details beyond the stepwise template are not enumerated.
Training and Optimization
Training is staged and mixes pretraining with substantial post/preference-style RL investment to close efficiency and capability gaps.
Pretraining highlights:
- Reported total pretraining token counts: "2.1B" and "943.7B" (both figures cited).
- Data mixture aligned with the 128K long context extension data.
- Objectives include KL-divergence loss for the indexer and maximizing policy model objective using GRPO.
-
Optimizer and schedule notes:
-
Learning rate of 10^-3 for warm-up.
- Learning rate of 7.3 × 10^-6 for sparse training.
- Use of Group Relative Policy Optimization (GRPO).
-
Important hyperparameters and schedules:
-
Post-training computational budget exceeds 10% of pre-training cost.
- 1000 steps for warm-up.
- 15000 steps for sparse training.
- Additional RL-specific knobs: clipping range (𝜀), KL penalty strength (𝛽), and threshold of policy divergence (𝛿).
- Compute summary: RL training budget exceeds 10% of the pre-training cost; there are plans to scale up pre-training compute to address knowledge gaps.
Post-training and alignment:
- A supervised fine-tuned artifact is referenced as DeepSeek-V3.2-SFT.
- Mixed RL training with GRPO and incorporation of rule-based outcome reward, length penalty, and language consistency reward are used during RL phases.
- Keep Routing and sampling strategies (combining top-p sampling with a Keep Sampling Mask) are used to preserve language consistency during RL training.
Evaluation and Benchmarks
Headline evaluation claims:
- DeepSeek-V3.2 performs comparably to GPT-5 on reported reasoning benchmarks.
- DeepSeek-V3.2-Speciale is reported to surpass GPT-5 and achieve performance parity with Gemini-3.0-Pro.
- DeepSeek-V3.2-Exp shows similar performance to DeepSeek-V3.1-Terminus.
A wide set of benchmarks and metrics were reported across reasoning, code, tool use, and agentic evaluations. Selected benchmark highlights and comparative numbers as reported:
- MMLU-Pro (EM): Sonnet High Pro: 88.2; Claude-4.5: 90.1; GPT-5: 84.6; Gemini-3.0: 85.0; Kimi-K2: 77.7; MiniMax: 82.4. (DeepSeek-V3.2 Thinking: Not provided.)
- HLE (Pass@1): Sonnet High Pro: 13.7 / 32.0 (two different reporting blocks); Claude-4.5: 37.7 / 35.2; GPT-5: 23.9 / 45.8; Gemini-3.0: 25.1 / 44.9; Kimi-K2: 12.5 / 31.8. DeepSeek-V3.2 Thinking reported as 23.9 in one entry and Not provided in others.
- LiveCodeBench (Pass@1-COT): Sonnet High Pro: 64.0 / 84.5; Claude-4.5: 90.7; GPT-5: 82.6; Gemini-3.0: 83.0; Kimi-K2: 83.3. DeepSeek-V3.2 Thinking: Not provided for many entries.
- Terminal Bench 2.0: Achieved score of 46.4 using Claude Code framework; achieved score of 39.3 in non-thinking mode.
- Code and contest-level claims: gold-medal level performance reported on International Mathematical Olympiad (IMO) 2025, International Olympiad in Informatics (IOI) 2025, ICPC World Final 2025, and CMO 2025 in several entries; additional mentions include ranked 2nd in ICPC WF 2025 and ranked 10th in IOI 2025.
- AA-LCR benchmark: DeepSeek-V3.2-Exp scores four points higher than DeepSeek-V3.1-Terminus.
- Fiction.liveBench and other agent/long-tail benchmarks: DeepSeek-V3.2-Exp consistently outperforms DeepSeek-V3.1-Terminus.
- SWE-bench Verified: reported as "Significantly outperforms open-source LLMs."
Where it wins: the model family is reported to excel in long-tail agent tasks, long-context tasks, and cost-efficient agent scenarios; RL training increases parity with proprietary models and significantly improves open-source standing on tool-use and agent benchmarks.
Where it is weaker: token efficiency is reported as inferior to Gemini-3.0-Pro; knowledge breadth and token-efficiency remain challenges relative to leading proprietary models, and longer generation trajectories are required for token efficiency.
Strengths and Weaknesses
Strengths:
- Long-context capacity (128K) combined with sparse attention enables targeted computation on long inputs.
- Agentic and tool-use proficiency via a large-scale agentic task synthesis pipeline and executable environments.
- Improved RL stability and routing consistency thanks to the Keep Routing operation and routing enforcement in training.
Weaknesses:
- Token efficiency remains a reported challenge, requiring longer generation trajectories and yielding lower token efficiency than some competitors.
- Knowledge breadth is noted as behind leading proprietary models, attributed in part to fewer total pretraining FLOPs.
- Performance constrained by a length-constraint reward model and occasional exceedance of context length limits.
Limitations and Open Questions
Stated limitations and caveats include:
- Performance constrained by the length constraint reward model.
- Context length frequently exceeds the 128K limit in practice.
- Knowledge gap resulting from fewer total training FLOPs compared to some proprietary models.
- Token efficiency remains a challenge and can lead to inferior behavior on certain complex tasks.
- Inferior performance on some frontier tasks relative to leading proprietary models.
Open questions for further development are implied by plans to scale pretraining compute and to expand RL training budgets to close remaining gaps.
Notable Numbers, Artifacts, and Practices
Reported artifacts and quantitative details:
- Generated over 1,800 distinct environments and 85,000 complex prompts.
- Number of tasks for agents and interpreters: code agent: 24667; search agent: 50275; general agent: 4417; code interpreter: 5908.
- 1,827 task-oriented environments synthesized and "4,417 total tasks generated" appear as listed counts.
- Keep Routing operation was adopted in the RL training pipeline since DeepSeek-V3-0324.
- Sampling strategies: combining top-p sampling with the Keep Sampling Mask to preserve language consistency during RL training.