DeepSeek-V3 — Model, Architecture, Training, and Evaluation

Overview and variants

DeepSeek-V3 is an open-source, large Mixture-of-Experts (MoE) language model family developed by DeepSeek-AI (DeepSeek). Key released variants include DeepSeek-V3 and DeepSeek-V3-Base, with additional specialized variants such as DeepSeekCoder-V2 used for coding-focused evaluation and comparisons against other public models (for example Qwen2.5 72B Base and LLaMA-3.1 405B Base). The flagship model configuration reports 671B total parameters with 37B activated for each token, and supports very long contexts up to 128K tokens in some reported configurations. Pretraining token budgets include a reported 14.8 trillion (14.8T) tokens in primary runs, with other cited totals such as 1.33T and 540B in auxiliary experiments.

Key contributions and innovations

671B total parameters / 37B activated per token: a large MoE instantiation that activates a subset of parameters per token to scale capacity while reducing per-step compute.
Multi-head Latent Attention (MLA) and DeepSeekMoE: architectural adaptations intended to enable efficient inference and cost-effective training at large scale.
auxiliary-loss-free load balancing strategy with dynamic bias adjustment and complementary sequence-wise auxiliary loss to avoid extreme expert imbalance.
Multi-Token Prediction (MTP) training objective: predicts multiple future tokens (reported as predicting next 2 tokens) to improve data efficiency and prediction accuracy.
Engineering and system-level advances including the DualPipe pipeline algorithm, efficient cross-node all-to-all kernels, and fine-grained quantization and FP8 mixed-precision support for economical training.

Architecture and notable design choices

The model family is based on Transformer principles with explicit Mixture-of-Experts (MoE) layers and several specialized designs:

Core architectural statistics reported include 61 layers, hidden size 7168, and 128 attention heads.
MoE design elements: finer-grained experts, isolation of some shared experts in DeepSeekMoE, a configuration with 1 shared expert and 256 routed experts per MoE layer, and 8 experts activated per token in typical routing. Routing constraints include sending each token to at most 4 nodes and a claim that a maximum of 13 experts can be routed while preserving communication cost.
Attention and compression: Multi-head Latent Attention for efficient inference, and low-rank joint compression for attention keys and values with RMSNorm after compressed latent vectors.
Load balancing: an auxiliary-loss-free strategy that dynamically adjusts a bias term (update speed 𝛾) and employs a complementary sequence-wise auxiliary loss; reported balance factor (𝛼) and weighting factor for MTP loss (𝜆) are among the tuning knobs.
Parallelism and scheduling: hybrid parallelism including reported modes such as 16-way Pipeline Parallelism (PP), 64-way Expert Parallelism (EP) in some descriptions, 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), 8-way Data Parallelism (DP8), and other partitioning variants (e.g., EP32 noted in specific setups). Bidirectional pipeline scheduling for micro-batches, warp specialization, and per-group scaling factors inside GEMM operations are used to improve computation-to-communication overlap.
Precision and quantization: FP8 mixed-precision training with a fine-grained FP8 quantization method, adoption of E4M3/E5M6-like formats referenced for activations and optimizer states, online quantization for accurate scaling, integral power-of-2 scaling factors, and promotion to CUDA cores for higher-precision accumulation in key paths. There is also mention of customized data formats and caching strategies (e.g., caching inputs of the SwiGLU operator).

Tokenizer, prompt format, and system prompt behavior

Tokenization uses a byte-level BPE tokenizer with a reported vocabulary of 128K tokens. The family adopts the "Zero-Eval prompt format" as the chat/prompt convention, and system prompts are designed to guide responses with reflection and verification mechanisms.

Pretraining data, objectives, and optimization

Pretraining and objective choices:

Training objective: Multi-Token Prediction (MTP) with cross-entropy loss applied to the multi-token targets (reported as predicting the next 2 tokens), intended to increase data efficiency.
Data: training mixes include an enhanced ratio of mathematical and programming samples; language coverage is described as multilingual beyond English and Chinese with reported counts such as 8 and 2 in different breakdowns.
Token budgets: several token totals are cited, notably 14.8 trillion (14.8T) tokens across major runs; other values reported include 1.33T and 540B depending on variant experiments.
Optimizer and schedule: AdamW optimizer with BF16 format for first and second moments; reported schedules include a learning rate that first increases to 2.2 × 10 − 4 in 2K steps, remains constant at 2.2 × 10 − 4 until 10T tokens, then decays to 2.2 × 10 − 5 in 4.3T tokens. A cosine decay schedule from 5 × 10 − 6 to 1 × 10 − 6 is also cited for some phases.
Important hyperparameters: β1 = 0.9, β2 = 0.95, weight_decay = 0.1, gradient clipping norm = 1.0, batch size increases from 3072 to 15360 over training, FIM strategy applied at a rate of 0.1, and two epochs reported for fine-tuning stages.

Post-training alignment and fine-tuning

Post-training steps include supervised fine-tuning (SFT) and preference alignment:

SFT: used on variants such as DeepSeek-V3-Base with a reported dataset of 1.5M instances spanning multiple domains and distilled reasoning capability from an internal DeepSeek-R1 series.
Preference alignment: a Constitutional AI approach is reported as used for aligning model behavior and preferences.

Parallelism, compute footprint, and deployment constraints

Engineering and compute summary:

Training compute: reported totals include 2.788M H800 GPU hours for full training and equivalent reported figures such as Pre-training costs: 2664K GPU hours, Context length extension costs: 119K GPU hours, Post-training costs: 5K GPU hours, and a consolidated figure of Total training costs: 2788K GPU hours.
Monetary cost: Total training costs: $5.576M. A claim is made that training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours.
Hardware and interconnect: training reportedly used a cluster with 2048 NVIDIA H800 GPUs; each node contains 8 GPUs connected by NVLink and NVSwitch; InfiniBand interconnects used across nodes. NVLink bandwidth is cited at 160 GB/s, compared with IB at 50 GB/s.
Deployment minima and throughput notes: minimum deployment unit for prefilling stage cited as 4 nodes with 32 GPUs and for decoding stage 40 nodes with 320 GPUs; batch size per expert during decoding usually within 256 tokens. Some low-level engineering tradeoffs are reported such as SM allocation for communication: 20 out of 132 SMs on H800 GPUs and strategies to overlap kernel dispatch with computation.
Communication and routing constraints: claims of a computation-to-communication ratio of approximately 1:1, and that a maximum of 13 experts can be routed while preserving communication cost.

Evaluation highlights and benchmark performance

Headline claims: DeepSeek-V3 is presented as the best-performing open-source base model on many benchmarks, with performance comparable to leading closed-source models in multiple evaluations. Reported highlights include:

MMLU: Score 88.5 (comparisons listed to GPT-4o and Claude-3.5).
MMLU-Pro: Score 75.9.
GPQA: Score 59.1.
AlpacaEval 2.0: win rate 85.5 against a set of comparators including DeepSeek-V2.5-0905, Qwen2.5-72B-Instruct, LLaMA-3.1 405B, GPT-4o-0513, and Claude-Sonnet-3.5-1022.
Arena-Hard: achieves over 86% win rate against GPT-4-0314 in reported head-to-heads.
HumanEval (generation/code): reported Pass@1: 65.2 for the leading DeepSeek-V3 Base entry among the comparators shown.
MATH-500 and other math benchmarks: state-of-the-art or top-tier reported performance; specific claims include surpassing Qwen2.5 72B by approximately 10% in AIME, MATH-500, and CNMO 2024 absolute scores.
DROP: 3-shot F1 of 91.6 in one reported set.
Long-context behavior: performs well across context window lengths up to 128K (Needle In A Haystack benchmark).
Comparative matrices in reporting show repeated per-benchmark EM, F1, Pass@1, BPB, and other metrics across four baseline comparators (e.g., DeepSeek-V2 Base, Qwen2.5 / LLaMA-3.1 72B Base, 405B Base, DeepSeek-V3 Base) with many entries showing the DeepSeek-V3 Base as the top entry.

Collectively, evaluation summaries emphasize strengths on English, multilingual, code, and math tasks, long-context understanding, and many educational benchmarks.

Strengths and weaknesses

Strengths:

Strongest open-source base model reported for code and math tasks, with wide multilingual coverage and long-context capabilities.
System and algorithm co-design enabling reduced pipeline bubbles, efficient communication overlap, and economical training costs (FP8 support and quantization).
Demonstrated strong performance on many standard academic and competition benchmarks with several top-tier scores.

Limitations and weaker points:

Slightly below Claude-Sonnet-3.5 on certain engineering-related tasks and some Chinese benchmarks (CMMLU, SimpleQA).
Deployment challenges for smaller teams due to minimum deployment unit sizes and large recommended deployment footprints.
End-to-end generation speed and inference throughput are noted as areas with room for improvement.
Precision caveats in FP8 GEMM on NVIDIA H800 GPUs: limited accumulation precision retaining around 14 bits in some operations.

Limitations, caveats, and open questions

Reported limitations and caveats include:

"None reported during training process" alongside specific numerical limitations such as the limited accumulation precision in FP8 GEMM on NVIDIA H800 GPUs (around 14 bits).
Deployment challenges for small teams due to minimum node/GPU counts and specialized interconnect needs.
Ongoing needs to improve end-to-end generation speed and inference efficiency for some deployment configurations.

Notable operational numbers and quoted metrics

Key quoted figures and engineering tradeoffs that characterize scale and behavior:

Total training costs: $5.576M.
Training costs in H800 GPU hours: 2788K (also cited as 2.788M H800 GPU hours).
Per-trillion-token cost claim: 180K H800 GPU hours per 1T tokens.
NVLink bandwidth: 160 GB/s (noted as roughly 3.2× that of IB at 50 GB/s).
Relative loss error of FP8-trained models remains consistently below 0.25% in reported characterizations.
K = 4096 for certain GEMM operations results in maximum relative error near 2%.
Minimum accumulation interval setting: NC = 128 elements.
Minimum deployment unit of prefilling stage: 4 nodes with 32 GPUs; decoding stage: 40 nodes with 320 GPUs.
Batch size per expert during decoding: usually within 256 tokens.
SM allocation for communication: 20 out of 132 SMs available in H800 GPU in reported setups.
Experimental evaluation settings: temperature 0.7 for AIME and CNMO 2024 evaluations; maximum output 8192 tokens for some benchmarks; results for some math assessments averaged over 16 runs.

Sources

https://arxiv.org/abs/2412.19437