Skip to content

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Overview

Distribution-Aligned Sequence Distillation (DASD) is a data-efficient sequence-level distillation pipeline developed by Alibaba Cloud Computing to enhance reasoning capabilities in compact models. The approach targets multi-domain reasoning tasks — including mathematics, code generation, scientific reasoning, and complex instruction-following — by addressing core distillation challenges: inadequate coverage of the teacher's sequence-level distribution, misalignment between teacher outputs and student capacity, and exposure bias from teacher-forced training versus autoregressive inference.

Primary model variants associated with the approach include DASD-4B-Thinking, Qwen3-4B-Instruct-2507, Qwen3-Next-80B-A3B-Thinking, DASD-30B-A3B-Thinking-Preview, and gpt-oss-120b. The underlying research is reported under the title "Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning."

Key components and contributions

The training pipeline emphasizes sequence-level alignment and selective sampling to transfer multi-step reasoning behavior from large teacher models into smaller student models while remaining data-efficient. Key contributions include:

  • Divergence-aware sampling (DAS) to prioritize examples rich in Teacher Sentences for effective knowledge transfer.
  • Temperature-scheduled learning to broaden coverage of the teacher's output distribution and stabilize early-stage learning.
  • Mixed-policy distillation, implemented as a lightweight stage to blend sampling policies and reduce exposure bias.
  • Rigorous response filtering and quality control, enabling strong performance with limited data.

Exactly 1 bullet list follows that highlights the core methods:

  • Divergence-aware sampling (DAS)
  • Temperature-scheduled learning
  • Mixed-policy distillation
  • Rigorous response filtering and quality control

Architecture and notable design choices

The models produced under this approach use a Mixture-of-Experts (MoE) architecture with sparse expert routing. Memory and compute optimizations reported include the use of ZeRO-3 optimization and Liger kernels to reduce memory footprint. Notable design choices aim to support sequence-level distillation at scale while maintaining practical resource usage:

  • Focus on sequence-level distribution coverage and emphasizing Teacher Sentences during sampling.
  • Temperature scheduling to control the diversity and difficulty of teacher responses presented during training.
  • Sparse expert routing to leverage MoE efficiency-quality trade-offs.

Specific per-layer or parameter specifications (e.g., number of layers, hidden size, attention heads) were not provided.

Training methodology, datasets, and hyperparameters

Training emphasized data efficiency and multi-domain coverage. The reported training footprint and configuration include the following exact numbers and mixes.

Data and sampling:

  • Total training footprint described as "448K", also reported as 448000 and "448K samples".
  • Multi-domain mixtures referenced include combinations such as "25K Math + 10K Code + 10K Science + RS (T = 0.6)", "25K Math + 10K Code + 10K Science + RS (T = 1.0)", "50K Math + 20K Code + 20K Science + RS (T = 1.0)", and a cold-start variant "25K Math + 10K Code + 10K Science + RS (T = 1.0) w/ cold start (T = 0.6)".
  • Sources documented: 105K mathematical reasoning questions from NVIDIA AceReason, code generation questions from the OpenCodeReasoning dataset, scientific reasoning from NVIDIA's OpenScience Reasoning dataset, and instruction-following questions from AM-DeepSeek-R1Distilled-1.4M.
  • A total inventory of responses is noted as "Total of 105K low-temperature (T=0.6) and 330K high-temperature (T=1.0) responses."

Training hyperparameters and compute:

  • Initial learning rate of 5e-5 decaying to 1e-5 via a cosine scheduler.
  • Cutoff length set to 64K.
  • Global batch size of 64 over 6 epochs.
  • Greedy sequence packing used to accelerate training.

Post-training fine-tuning and SFT:

  • Supervised fine-tuning (SFT) was used ("yes").
  • Reported SFT sample summaries include "50K math responses sampled at low temperature (T=0.6) and high temperature (T=1.0)" and "100K samples at high temperature (T=1.0)."
  • No preference-alignment method or data summaries were provided.

Evaluation, benchmarks, and selected results

Evaluation emphasizes reasoning performance and comparisons across sampling strategies. Headline claims include state-of-the-art performance among models of comparable scale, top-tier reasoning capability, and outperforming several larger counterparts on key benchmarks.

Reported benchmark highlights (exact values preserved):

  • AIME24: 88.5 (score), presented also as "DASD-4B-Thinking: 88.5" versus "AM-thinking-v1: 85.3".
  • AIME25: 83.3 (score), presented also as "DASD-4B-Thinking: 83.3" versus "AM-thinking-v1: 74.4".
  • LiveCodeBench v5: 69.3 (score) for DASD-4B-Thinking; comparators listed include DeepSeek-R1-0528-Qwen3-8B: 60.5, Qwen3-14B: 63.5, NVIDIA-OpenReasoning-Nemotron-7B: 63.9.
  • LiveCodeBench v6: 67.5 for DASD-4B-Thinking versus Qwen3-4B-Thinking-2507: 55.2.
  • GPQA-D / GPQA-Diamond: 68.4 for DASD-4B-Thinking; other reported figures include Qwen3-32B: 68.4 and NVIDIA-Nemotron-Ultra-253B: 76.0.
  • Performance gains associated with sampling and data scale:

  • "+1.4 improvement with T=1.0 samples over T=0.6 samples" on AIME24.

  • "+4.2 improvement with T=1.0 samples over T=0.6 samples" on AIME25.
  • "+2.8 improvement with 100K T=1.0 samples over 50K T=1.0 samples" on AIME25.
  • A reported "+4.1" gain on AIME24 and "+1.8" on AIME25 when comparing certain mixtures against "T=1.0 data alone".
  • Controlled comparisons indicate that DAS consistently achieves higher test performance compared to Random Sampling (RS) across evaluated settings (explicit results include AIME24 and AIME25 accuracy deltas: e.g., "50K Math + RS (T = 0.6): 81.7" vs "50K Math + DAS (T = 0.6): 83.3"; "25K Math + RS: 79.0" vs "25K Math + DAS: 82.5").

Comparative claims:

  • The approach reports outperforming many comparable-size models and in several cases surpassing larger models (including particular 32B-scale models) on reasoning benchmarks.
  • Specific variant performance summaries include "Qwen3-4B-Instruct-2507: 47.4% to 74.0% (+26.6%)" on AIME25 and "DASD-30B-A3B-Thinking-Preview: 86.7% (+1.7%)" on AIME25, among other comparative figures on LCB v6 and GPQA-D.

Where it excels and where it struggles

Strengths:

  • Demonstrates compact reasoning capability and data efficiency, achieving competitive results using only 448K training samples.
  • Strong performance on mathematical, coding, and scientific reasoning benchmarks (notably AIME24 and AIME25).
  • Effective at identifying and prioritizing teacher-generated sequences through DAS, producing consistent test improvements over random sampling.
  • Shows superior efficiency-quality trade-offs at MoE scale, with sparse routing and memory optimizations aiding practical training.

Weaknesses and failure modes:

  • Difficulty for student models to learn effectively from high-temperature (T=1.0) data due to increased learning difficulty.
  • Boosted or high-divergence Teacher Sentences may negatively correlate with test-set accuracy in some cases.
  • The student model's capacity to absorb diverse teacher behaviors is a recognized bottleneck.

Limitations and open questions

  • Student model capacity limits transfer: the ability of a student to absorb diverse teacher behaviors is explicitly noted as a bottleneck.
  • Learning from high-temperature teacher outputs is challenging and can reduce effective learning despite increases in data diversity.
  • Additional architectural or scale-specific details (exact parameter counts, per-layer sizes, tokenizer specifics, and context window sizes) were not provided, limiting reproducibility of exact configurations from the available information.

Notable figures and quotes

  • Reported total training responses: "Total of 105K low-temperature (T=0.6) and 330K high-temperature (T=1.0) responses."
  • Reported training footprint: "448K" / 448000 / "448K samples".
  • SFT and sample breakdowns cited: "50K math responses sampled at low temperature (T=0.6) and high temperature (T=1.0)" and "100K samples at high temperature (T=1.0)."

Reported attribution

  • Organization: Alibaba Cloud Computing.
  • Report/paper title: "Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning."

Sources

https://arxiv.org/abs/2601.09088v1