Skip to content

MiniMax-M1 Model Documentation

Overview

Model Name: MiniMax-M1
Aliases: MiniMax-Text-01, CISPO Algorithm
Context Length Support: Up to 1 million tokens
Key Features: - Hybrid Mixture-of-Experts (MoE) architecture - Lightning Attention mechanism for efficient long-context processing - CISPO algorithm for enhanced reinforcement learning (RL) efficiency

Problem Addressed

MiniMax-M1 is designed to tackle challenges in long-context reasoning tasks, particularly in:

  • Efficiently scaling test-time compute for processing extensive inputs.
  • Enhancing reasoning capabilities and maintaining token integrity during updates.
  • Providing rule-based correctness as a reward for complex tasks without ground truth, such as instruction-following and creative writing.

Limitations of Existing Methods

Traditional transformer architectures exhibit quadratic computational complexity, making them inefficient for extensive reasoning tasks. Previous methods have not been fully validated for large-scale reasoning models, and they often struggle with open-ended queries that lack ground-truth answers.

Key Contributions

  • First open-weight, large-scale hybrid-attention reasoning model: Integrates MoE with a lightning attention mechanism.
  • CISPO Algorithm: Introduces a novel approach to optimize RL training by clipping importance sampling weights.
  • Data Quality Optimization: Trained on 7.5 trillion tokens with a focus on high-quality datasets for mathematical, logical, and software engineering tasks.
  • Continuous Monitoring: Implements a pairwise comparison framework and human-annotated benchmarks for evaluating model responses.

Training and Evaluation

Training Pipeline

  1. Continual Pretraining: Utilizes 7.5 trillion tokens.
  2. Supervised Fine-Tuning (SFT): Focuses on reflection-based chain-of-thought reasoning.
  3. Reinforcement Learning (RL) Training: Incorporates rule-based rewards and gradually mixes in general domain tasks.

Evaluation Metrics

  • Achieves strong performance on various benchmarks, including AIME 2024 (86.0% for MiniMax-M1-80k) and surpasses leading models like DeepSeek-R1 and Qwen-235B in overall performance.
  • Demonstrates significant reductions in computational costs, consuming less than 50% of the FLOPs compared to competitors at similar generation lengths.

Core Techniques

CISPO

  • Purpose: Enhance RL efficiency and stabilize training.
  • Functionality: Clips importance sampling weights instead of token updates, leveraging all tokens for gradient computations.

Lightning Attention

  • Purpose: Efficiently processes long-context inputs.
  • Functionality: Implements an I/O-aware linear attention variant to reduce computational costs.

Dynamic Sampling and Length Penalty

  • Techniques employed to optimize sampling strategies and manage output length effectively.

Early Truncation via Repetition Detection

  • Prevents instability from repetitive responses by halting generation when excessive token repetition is detected.

Practical Considerations

Hyperparameters

  • Model Size: 456 billion parameters, 32 experts.
  • Learning Rate: Starts at 8e-5, decaying to 8e-6 over 5 trillion tokens.
  • Thinking Budgets: 40K and 80K tokens.

Stability Measures

  • Employs early stopping and context length management to prevent excessive consumption of context windows.

Common Challenges

  • Gradient explosion and non-convergence during training.
  • Potential pattern collapse leading to incoherent outputs.

Computational Requirements

  • Hardware: Requires 512 H800 GPUs for full training.
  • Cost: Approximately $534,700 for complete RL training, completed within three weeks.

Performance Insights

Strengths

  • Excels in complex software engineering and long-context tasks, achieving top scores in relevant benchmarks.
  • Outperforms all other open-weight models in long-context understanding.

Weaknesses

  • Lags behind in mathematical reasoning and coding competitions compared to specific models like DeepSeek-R1-0528.

Conclusion

MiniMax-M1 represents a significant advancement in long-context reasoning models, leveraging innovative techniques to enhance efficiency and performance in complex reasoning tasks. Its unique architecture and training methodologies position it as a leading solution in the landscape of open-weight models.

Sources

https://arxiv.org/abs/2506.13585v1