Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Overview

Distribution-Aligned Sequence Distillation (DASD) is a data-efficient sequence-level distillation pipeline developed by Alibaba Cloud Computing to enhance reasoning capabilities in compact models. The approach targets multi-domain reasoning tasks — including mathematics, code generation, scientific reasoning, and complex instruction-following — by addressing core distillation challenges: inadequate coverage of the teacher's sequence-level distribution, misalignment between teacher outputs and student capacity, and exposure bias from teacher-forced training versus autoregressive inference.

Primary model variants associated with the approach include DASD-4B-Thinking, Qwen3-4B-Instruct-2507, Qwen3-Next-80B-A3B-Thinking, DASD-30B-A3B-Thinking-Preview, and gpt-oss-120b. The underlying research is reported under the title "Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning."

Key components and contributions

The training pipeline emphasizes sequence-level alignment and selective sampling to transfer multi-step reasoning behavior from large teacher models into smaller student models while remaining data-efficient. Key contributions include:

Divergence-aware sampling (DAS) to prioritize examples rich in Teacher Sentences for effective knowledge transfer.
Temperature-scheduled learning to broaden coverage of the teacher's output distribution and stabilize early-stage learning.
Mixed-policy distillation, implemented as a lightweight stage to blend sampling policies and reduce exposure bias.
Rigorous response filtering and quality control, enabling strong performance with limited data.

Exactly 1 bullet list follows that highlights the core methods:

Divergence-aware sampling (DAS)
Temperature-scheduled learning
Mixed-policy distillation
Rigorous response filtering and quality control

Architecture and notable design choices

The models produced under this approach use a Mixture-of-Experts (MoE) architecture with sparse expert routing. Memory and compute optimizations reported include the use of ZeRO-3 optimization and Liger kernels to reduce memory footprint. Notable design choices aim to support sequence-level distillation at scale while maintaining practical resource usage:

Focus on sequence-level distribution coverage and emphasizing Teacher Sentences during sampling.
Temperature scheduling to control the diversity and difficulty of teacher responses presented during training.
Sparse expert routing to leverage MoE efficiency-quality trade-offs.

Specific per-layer or parameter specifications (e.g., number of layers, hidden size, attention heads) were not provided.

Training methodology, datasets, and hyperparameters

Training emphasized data efficiency and multi-domain coverage. The reported training footprint and configuration include the following exact numbers and mixes.

Data and sampling:

Total training footprint described as "448K", also reported as 448000 and "448K samples".
Multi-domain mixtures referenced include combinations such as "25K Math + 10K Code + 10K Science + RS (T = 0.6)", "25K Math + 10K Code + 10K Science + RS (T = 1.0)", "50K Math + 20K Code + 20K Science + RS (T = 1.0)", and a cold-start variant "25K Math + 10K Code + 10K Science + RS (T = 1.0) w/ cold start (T = 0.6)".
Sources documented: 105K mathematical reasoning questions from NVIDIA AceReason, code generation questions from the OpenCodeReasoning dataset, scientific reasoning from NVIDIA's OpenScience Reasoning dataset, and instruction-following questions from AM-DeepSeek-R1Distilled-1.4M.
A total inventory of responses is noted as "Total of 105K low-temperature (T=0.6) and 330K high-temperature (T=1.0) responses."

Training hyperparameters and compute:

Initial learning rate of 5e-5 decaying to 1e-5 via a cosine scheduler.
Cutoff length set to 64K.
Global batch size of 64 over 6 epochs.
Greedy sequence packing used to accelerate training.

Post-training fine-tuning and SFT:

Supervised fine-tuning (SFT) was used ("yes").
Reported SFT sample summaries include "50K math responses sampled at low temperature (T=0.6) and high temperature (T=1.0)" and "100K samples at high temperature (T=1.0)."
No preference-alignment method or data summaries were provided.

Evaluation, benchmarks, and selected results

Evaluation emphasizes reasoning performance and comparisons across sampling strategies. Headline claims include state-of-the-art performance among models of comparable scale, top-tier reasoning capability, and outperforming several larger counterparts on key benchmarks.

Reported benchmark highlights (exact values preserved):

AIME24: 88.5 (score), presented also as "DASD-4B-Thinking: 88.5" versus "AM-thinking-v1: 85.3".
AIME25: 83.3 (score), presented also as "DASD-4B-Thinking: 83.3" versus "AM-thinking-v1: 74.4".
LiveCodeBench v5: 69.3 (score) for DASD-4B-Thinking; comparators listed include DeepSeek-R1-0528-Qwen3-8B: 60.5, Qwen3-14B: 63.5, NVIDIA-OpenReasoning-Nemotron-7B: 63.9.
LiveCodeBench v6: 67.5 for DASD-4B-Thinking versus Qwen3-4B-Thinking-2507: 55.2.
GPQA-D / GPQA-Diamond: 68.4 for DASD-4B-Thinking; other reported figures include Qwen3-32B: 68.4 and NVIDIA-Nemotron-Ultra-253B: 76.0.
Performance gains associated with sampling and data scale:
"+1.4 improvement with T=1.0 samples over T=0.6 samples" on AIME24.
"+4.2 improvement with T=1.0 samples over T=0.6 samples" on AIME25.
"+2.8 improvement with 100K T=1.0 samples over 50K T=1.0 samples" on AIME25.
A reported "+4.1" gain on AIME24 and "+1.8" on AIME25 when comparing certain mixtures against "T=1.0 data alone".
Controlled comparisons indicate that DAS consistently achieves higher test performance compared to Random Sampling (RS) across evaluated settings (explicit results include AIME24 and AIME25 accuracy deltas: e.g., "50K Math + RS (T = 0.6): 81.7" vs "50K Math + DAS (T = 0.6): 83.3"; "25K Math + RS: 79.0" vs "25K Math + DAS: 82.5").

Comparative claims:

The approach reports outperforming many comparable-size models and in several cases surpassing larger models (including particular 32B-scale models) on reasoning benchmarks.
Specific variant performance summaries include "Qwen3-4B-Instruct-2507: 47.4% to 74.0% (+26.6%)" on AIME25 and "DASD-30B-A3B-Thinking-Preview: 86.7% (+1.7%)" on AIME25, among other comparative figures on LCB v6 and GPQA-D.

Where it excels and where it struggles

Strengths:

Demonstrates compact reasoning capability and data efficiency, achieving competitive results using only 448K training samples.
Strong performance on mathematical, coding, and scientific reasoning benchmarks (notably AIME24 and AIME25).
Effective at identifying and prioritizing teacher-generated sequences through DAS, producing consistent test improvements over random sampling.
Shows superior efficiency-quality trade-offs at MoE scale, with sparse routing and memory optimizations aiding practical training.

Weaknesses and failure modes:

Difficulty for student models to learn effectively from high-temperature (T=1.0) data due to increased learning difficulty.
Boosted or high-divergence Teacher Sentences may negatively correlate with test-set accuracy in some cases.
The student model's capacity to absorb diverse teacher behaviors is a recognized bottleneck.

Limitations and open questions

Student model capacity limits transfer: the ability of a student to absorb diverse teacher behaviors is explicitly noted as a bottleneck.
Learning from high-temperature teacher outputs is challenging and can reduce effective learning despite increases in data diversity.
Additional architectural or scale-specific details (exact parameter counts, per-layer sizes, tokenizer specifics, and context window sizes) were not provided, limiting reproducibility of exact configurations from the available information.

Notable figures and quotes

Reported total training responses: "Total of 105K low-temperature (T=0.6) and 330K high-temperature (T=1.0) responses."
Reported training footprint: "448K" / 448000 / "448K samples".
SFT and sample breakdowns cited: "50K math responses sampled at low temperature (T=0.6) and high temperature (T=1.0)" and "100K samples at high temperature (T=1.0)."

Reported attribution

Organization: Alibaba Cloud Computing.
Report/paper title: "Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning."

Sources

https://arxiv.org/abs/2601.09088v1