Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
Overview
Distribution-Aligned Sequence Distillation (DASD) is a data-efficient sequence-level distillation pipeline developed by Alibaba Cloud Computing to enhance reasoning capabilities in compact models. The approach targets multi-domain reasoning tasks — including mathematics, code generation, scientific reasoning, and complex instruction-following — by addressing core distillation challenges: inadequate coverage of the teacher's sequence-level distribution, misalignment between teacher outputs and student capacity, and exposure bias from teacher-forced training versus autoregressive inference.
Primary model variants associated with the approach include DASD-4B-Thinking, Qwen3-4B-Instruct-2507, Qwen3-Next-80B-A3B-Thinking, DASD-30B-A3B-Thinking-Preview, and gpt-oss-120b. The underlying research is reported under the title "Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning."
Key components and contributions
The training pipeline emphasizes sequence-level alignment and selective sampling to transfer multi-step reasoning behavior from large teacher models into smaller student models while remaining data-efficient. Key contributions include:
- Divergence-aware sampling (DAS) to prioritize examples rich in Teacher Sentences for effective knowledge transfer.
- Temperature-scheduled learning to broaden coverage of the teacher's output distribution and stabilize early-stage learning.
- Mixed-policy distillation, implemented as a lightweight stage to blend sampling policies and reduce exposure bias.
- Rigorous response filtering and quality control, enabling strong performance with limited data.
Exactly 1 bullet list follows that highlights the core methods:
- Divergence-aware sampling (DAS)
- Temperature-scheduled learning
- Mixed-policy distillation
- Rigorous response filtering and quality control
Architecture and notable design choices
The models produced under this approach use a Mixture-of-Experts (MoE) architecture with sparse expert routing. Memory and compute optimizations reported include the use of ZeRO-3 optimization and Liger kernels to reduce memory footprint. Notable design choices aim to support sequence-level distillation at scale while maintaining practical resource usage:
- Focus on sequence-level distribution coverage and emphasizing Teacher Sentences during sampling.
- Temperature scheduling to control the diversity and difficulty of teacher responses presented during training.
- Sparse expert routing to leverage MoE efficiency-quality trade-offs.
Specific per-layer or parameter specifications (e.g., number of layers, hidden size, attention heads) were not provided.
Training methodology, datasets, and hyperparameters
Training emphasized data efficiency and multi-domain coverage. The reported training footprint and configuration include the following exact numbers and mixes.
Data and sampling:
- Total training footprint described as "448K", also reported as 448000 and "448K samples".
- Multi-domain mixtures referenced include combinations such as "25K Math + 10K Code + 10K Science + RS (T = 0.6)", "25K Math + 10K Code + 10K Science + RS (T = 1.0)", "50K Math + 20K Code + 20K Science + RS (T = 1.0)", and a cold-start variant "25K Math + 10K Code + 10K Science + RS (T = 1.0) w/ cold start (T = 0.6)".
- Sources documented: 105K mathematical reasoning questions from NVIDIA AceReason, code generation questions from the OpenCodeReasoning dataset, scientific reasoning from NVIDIA's OpenScience Reasoning dataset, and instruction-following questions from AM-DeepSeek-R1Distilled-1.4M.
- A total inventory of responses is noted as "Total of 105K low-temperature (T=0.6) and 330K high-temperature (T=1.0) responses."
Training hyperparameters and compute:
- Initial learning rate of 5e-5 decaying to 1e-5 via a cosine scheduler.
- Cutoff length set to 64K.
- Global batch size of 64 over 6 epochs.
- Greedy sequence packing used to accelerate training.
Post-training fine-tuning and SFT:
- Supervised fine-tuning (SFT) was used ("yes").
- Reported SFT sample summaries include "50K math responses sampled at low temperature (T=0.6) and high temperature (T=1.0)" and "100K samples at high temperature (T=1.0)."
- No preference-alignment method or data summaries were provided.
Evaluation, benchmarks, and selected results
Evaluation emphasizes reasoning performance and comparisons across sampling strategies. Headline claims include state-of-the-art performance among models of comparable scale, top-tier reasoning capability, and outperforming several larger counterparts on key benchmarks.
Reported benchmark highlights (exact values preserved):
- AIME24: 88.5 (score), presented also as "DASD-4B-Thinking: 88.5" versus "AM-thinking-v1: 85.3".
- AIME25: 83.3 (score), presented also as "DASD-4B-Thinking: 83.3" versus "AM-thinking-v1: 74.4".
- LiveCodeBench v5: 69.3 (score) for DASD-4B-Thinking; comparators listed include DeepSeek-R1-0528-Qwen3-8B: 60.5, Qwen3-14B: 63.5, NVIDIA-OpenReasoning-Nemotron-7B: 63.9.
- LiveCodeBench v6: 67.5 for DASD-4B-Thinking versus Qwen3-4B-Thinking-2507: 55.2.
- GPQA-D / GPQA-Diamond: 68.4 for DASD-4B-Thinking; other reported figures include Qwen3-32B: 68.4 and NVIDIA-Nemotron-Ultra-253B: 76.0.
-
Performance gains associated with sampling and data scale:
-
"+1.4 improvement with T=1.0 samples over T=0.6 samples" on AIME24.
- "+4.2 improvement with T=1.0 samples over T=0.6 samples" on AIME25.
- "+2.8 improvement with 100K T=1.0 samples over 50K T=1.0 samples" on AIME25.
- A reported "+4.1" gain on AIME24 and "+1.8" on AIME25 when comparing certain mixtures against "T=1.0 data alone".
- Controlled comparisons indicate that DAS consistently achieves higher test performance compared to Random Sampling (RS) across evaluated settings (explicit results include AIME24 and AIME25 accuracy deltas: e.g., "50K Math + RS (T = 0.6): 81.7" vs "50K Math + DAS (T = 0.6): 83.3"; "25K Math + RS: 79.0" vs "25K Math + DAS: 82.5").
Comparative claims:
- The approach reports outperforming many comparable-size models and in several cases surpassing larger models (including particular 32B-scale models) on reasoning benchmarks.
- Specific variant performance summaries include "Qwen3-4B-Instruct-2507: 47.4% to 74.0% (+26.6%)" on AIME25 and "DASD-30B-A3B-Thinking-Preview: 86.7% (+1.7%)" on AIME25, among other comparative figures on LCB v6 and GPQA-D.
Where it excels and where it struggles
Strengths:
- Demonstrates compact reasoning capability and data efficiency, achieving competitive results using only 448K training samples.
- Strong performance on mathematical, coding, and scientific reasoning benchmarks (notably AIME24 and AIME25).
- Effective at identifying and prioritizing teacher-generated sequences through DAS, producing consistent test improvements over random sampling.
- Shows superior efficiency-quality trade-offs at MoE scale, with sparse routing and memory optimizations aiding practical training.
Weaknesses and failure modes:
- Difficulty for student models to learn effectively from high-temperature (T=1.0) data due to increased learning difficulty.
- Boosted or high-divergence Teacher Sentences may negatively correlate with test-set accuracy in some cases.
- The student model's capacity to absorb diverse teacher behaviors is a recognized bottleneck.
Limitations and open questions
- Student model capacity limits transfer: the ability of a student to absorb diverse teacher behaviors is explicitly noted as a bottleneck.
- Learning from high-temperature teacher outputs is challenging and can reduce effective learning despite increases in data diversity.
- Additional architectural or scale-specific details (exact parameter counts, per-layer sizes, tokenizer specifics, and context window sizes) were not provided, limiting reproducibility of exact configurations from the available information.
Notable figures and quotes
- Reported total training responses: "Total of 105K low-temperature (T=0.6) and 330K high-temperature (T=1.0) responses."
- Reported training footprint: "448K" / 448000 / "448K samples".
- SFT and sample breakdowns cited: "50K math responses sampled at low temperature (T=0.6) and high temperature (T=1.0)" and "100K samples at high temperature (T=1.0)."
Reported attribution
- Organization: Alibaba Cloud Computing.
- Report/paper title: "Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning."