Skip to content

DASD-4B-Thinking Model Documentation

Overview

The DASD-4B-Thinking model, along with its variants DASD-30B-A3B-Thinking-Preview and Qwen3-4B-Instruct-2507, focuses on enhancing reasoning capabilities in smaller models through advanced distillation techniques. This model is designed to address specific challenges in reasoning tasks such as mathematics, code generation, and scientific reasoning.

Problem Statement

Challenges Addressed

  • Reasoning Enhancement: Improves reasoning capabilities in smaller models through effective distillation strategies.
  • Sequence-Level Distillation: Addresses inadequate coverage of the teacher's sequence-level distribution and misalignment between the teacher's output and the student's learning capacity.
  • Exposure Bias: Mitigates exposure bias resulting from teacher-forced training methods versus autoregressive inference during testing.

Limitations of Existing Methods

  • Existing methods often rely on heuristic rules for filtering supervised fine-tuning (SFT) data and do not adequately address core distillation principles.
  • They fail to cover the full support of the teacher's distribution and do not effectively manage misleading gradients during training.

Key Contributions

  • Enhanced Sequence-Level Distillation: Proposes a refined training pipeline that improves the interaction between teacher and student models.
  • Temperature-Scheduled Learning: Introduces a temperature-based approach to enhance learning efficiency by broadening coverage of the teacher's modes.
  • Divergence-Aware Sampling: Implements a sampling strategy that prioritizes high-quality teacher sentences, improving learning robustness.
  • Mixed-Policy Distillation: Combines on-policy and off-policy signals to mitigate exposure bias and improve overall model performance.

Training Methodology

Algorithm Overview

The training process consists of a two-stage pipeline: 1. Initial Off-Policy SFT Phase: Involves training on teacher-generated responses. 2. Fine-Tuning with Mixed-Policy Distillation: Incorporates diverse training examples to refine the model further.

Techniques Employed

  • Temperature-Scheduled Learning: Trains the student model on low-temperature, high-confidence samples before integrating higher-temperature samples.
  • Divergence-Aware Sampling (DAS): Identifies and prioritizes teacher sentences based on output probabilities, facilitating effective knowledge transfer.
  • Mixed-Policy Distillation: Re-generates responses using the student model while prompting the teacher for corrections, addressing exposure bias.

Evaluation and Performance

Benchmarking

The model has demonstrated state-of-the-art performance across various benchmarks, including:

  • AIME24: Achieved scores up to 88.5.
  • AIME25: Scored as high as 83.3.
  • LiveCodeBench: Performance scores of 69.3 and 67.5 in different iterations.

Comparative Analysis

  • Outperforms several larger models (e.g., 32B-scale) while maintaining fewer parameters.
  • Consistently achieves higher test performance compared to random sampling and other comparable models.

Limitations and Future Directions

  • The model's effectiveness is constrained by the student's capacity to absorb diverse teacher behaviors, which may become a bottleneck in performance.
  • Further exploration is needed to enhance the model's adaptability to a wider range of reasoning tasks.

Conclusion

The DASD-4B-Thinking model represents a significant advancement in the field of model distillation, particularly for smaller models. Its innovative techniques for enhancing reasoning capabilities and mitigating common training issues position it as a competitive option for various applications in mathematics, code generation, and scientific reasoning. Future work will focus on overcoming existing limitations and expanding the model's applicability across diverse domains.

Sources

https://arxiv.org/abs/2601.09088v1