MiniMax-M1 — Scalable Long-Context Reasoning Model

Overview

MiniMax-M1 (aliases include MiniMax-Text-01 and CISPO in some contexts) is positioned to enable efficient scaling of test-time compute for complex tasks that require processing very long inputs and generating extended outputs. The design emphasizes enhanced reasoning and long-context capabilities, supporting inputs of up to 1 million tokens and generation lengths extended to 80K tokens. The approach aims to avoid dropping tokens during updates, maintain entropy for stable exploration, and address tasks without ground truth such as instruction following and creative writing.

Key innovations and contributions

Hybrid Mixture-of-Experts (MoE) architecture, combined with a novel attention mechanism and an RL algorithm for efficient fine-tuning.

Architecture and design choices

The architecture is a Hybrid Mixture-of-Experts (MoE) design with a hybrid-lightning style of attention. Notable design choices include the introduction of the Lightning attention mechanism and a transformer topology where a transformer block with softmax attention follows every seven transnormer blocks. Context-length scaling is applied smoothly across four stages, starting from a 32K context window and extending to 1M tokens. The hybrid architecture of M1 is claimed to support near-linear scaling for longer sequences.

Additional architectural and data-management choices include:

Token-wise masking for hyperparameter tuning and hybrid attention intended to support efficient RL scaling.
Use of rule-based final correctness and format rewards when evaluating and optimizing instruction-following behavior.
Embedding-based deduplication across RL data sources and strict separation from the supervised fine-tuning (SFT) dataset to avoid overlap.
N-gram and embedding-based methods to eliminate contamination from benchmarks.
Reformulation of multiple-choice questions into open-ended formats for training and evaluation.

Variants and capacity-related specifications are described as follows: the model family includes MiniMax-M1-40K and MiniMax-M1-80K variants. Reported parameter-level specs include "456 billion total parameters", "45.9 billion parameters activated per token", and "32 experts". Context support is reported as up to 1 million tokens, while output-length limits have been extended from 40K to 80K tokens; one listed variant explicitly names an 80000-token context configuration for the 80k variant.

Training and fine-tuning regimen

Pretraining and continued training details:

Total continued training tokens reported: 7.5T tokens.
Data mixture focused on a reasoning-intensive corpus, including natural question–answer pairs and semantic deduplication applied to QA data.
Objectives and losses include decreasing the coefficient of the MoE auxiliary loss and using a CISPO objective with clipped importance sampling (IS) weight.
Optimizer and schedule: AdamW with a constant learning rate of 8e-5 for 2.5T tokens, then a decay schedule over 5T tokens down to 8e-6.
Important hyperparameters: β1 = 0.9, β2 = 0.95, ε = 1e-15.

Compute and run-time:

Training reported on 512 H800 GPUs over three weeks at a cost of $534,700.
A full RL run is reported to have been completed within three weeks using 512 H800 GPUs.
Training context extension to 1M tokens was applied during the runs.

Post-training:

Supervised fine-tuning (SFT) was used ("Yes"). SFT data emphasized specific reasoning patterns and injected long chain-of-thought responses covering math, coding, STEM, writing, QA, and multi-turn chat.
The SFT objective was to instill reflection-based Chain-of-Thought reasoning behaviors.
Preference-alignment details were not provided.

Key innovations and methodological highlights

Hybrid Mixture-of-Experts (MoE) architecture to enable large-parameter capacity with conditional computation.
Lightning attention to reduce the cost of long-range attention and enable longer context windows.
CISPO, a novel reinforcement learning algorithm introduced to improve training efficiency and stability in the RL stage; CISPO is reported to use clipped importance sampling weights and to introduce a slight bias in the gradient due to weight clipping.
A curated, reasoning-heavy data mixture and operational controls (deduplication, dataset separation, n-gram/embedding filtering) intended to reduce benchmark contamination and reinforce reasoning capabilities.

Evaluation and benchmark performance

Headline results indicate that CISPO significantly outperforms DAPO and GRPO with the same number of training steps and that the model family ranks among the world's best open-weight models by the reported metrics. Evaluations highlight strong performance on long-context understanding, tool use, and complex software engineering tasks.

Representative benchmark scores and outcomes (selected and reported verbatim) include:

AIME 2024: MiniMax- M1-40k: 83.3, MiniMax- M1-80k: 86.0. Reported passrate: 86.0%.
AIME 2025: MiniMax- M1-40k: 74.6, MiniMax- M1-80k: 76.9.
MATH-500: MiniMax- M1-40k: 96.0, MiniMax- M1-80k: 96.8.
LiveCodeBench: MiniMax- M1-40k: 62.3, MiniMax- M1-80k: 65.0; passrate said to match Qwen3-235B.
FullStackBench: MiniMax- M1-40k: 67.6, MiniMax- M1-80k: 68.3; reported to outperform Qwen3-235B.
GPQA Diamond: MiniMax- M1-40k: 69.2, MiniMax- M1-80k: 70.0.
HLE (no tools): MiniMax- M1-40k: 7.2, MiniMax- M1-80k: 8.4.
ZebraLogic: MiniMax- M1-40k: 80.1, MiniMax- M1-80k: 86.8.
MMLU-Pro: MiniMax- M1-40k: 80.6, MiniMax- M1-80k: 81.1.
SWE-bench Verified: MiniMax- M1-40k: 55.6, MiniMax- M1-80k: 56.0 (comparative DeepSeek-R1-0528 listed at 57.6%).
OpenAI-MRCR (128k): MiniMax- M1-40k: 76.1, MiniMax- M1-80k: 73.4.
OpenAI-MRCR (1M): MiniMax- M1-40k: 58.6, MiniMax- M1-80k: 56.2.
LongBench-v2: MiniMax- M1-40k: 61.0, MiniMax- M1-80k: 61.5.
TAU-bench (airline): MiniMax- M1-40k: 60.0, MiniMax- M1-80k: 62.0.
TAU-bench (retail): MiniMax- M1-40k: 67.8, MiniMax- M1-80k: 63.5.
SimpleQA: MiniMax- M1-40k: 17.9, MiniMax- M1-80k: 18.5.
MultiChallenge: MiniMax- M1-40k: 44.7, MiniMax- M1-80k: 44.7.

Where the approach is reported to excel:

Complex software engineering tasks.
Tool utilization and agentic tool use (TAU-Bench agentic tool use reported to outperform Gemini 2.5 Pro).
Long-context tasks and long-context understanding (reported to surpass OpenAI o3 and Claude 4 Opus on long-context measures).
Superior training efficiency when compared to other RL fine-tuning approaches such as DAPO and GRPO (CISPO reportedly attains comparable performance to DAPO with 50% of the training steps).

Where it is reported to be weaker:

Mathematical and coding competitions compared to DeepSeek-R1.
Susceptibility to length bias and pattern collapse during RL training.
Some benchmarks such as SimpleQA show lower factuality compared to DeepSeek-R1 while outperforming other open-weight models.

Limitations and open questions

Documented limitations and caveats include:

Precision mismatch between training and inference kernels that prevents reward growth during RL training.
Excessively aggressive extensions of training length can lead to gradient explosion.
Reported issues with GRPO adversely affecting training performance.
The gradient of the CISPO objective is slightly biased due to weight clipping.
Length bias may misguide RL policy optimization, and pattern collapse during RL training remains a concern.

Notable dataset and capacity figures

A selection of notable numbers and dataset facts reported:

Curated dataset of nearly 50K high-quality mathematical samples for RL training.
Synthesis of approximately 53K logical reasoning samples for RL training.
Generation of 30K competitive programming data samples for RL training.
Construction of verifiable reinforcement learning environments using real-world data from public GitHub repositories.
Output length limit extended from 40K tokens to 80K tokens.
Average response lengths on AIME and LiveCodeBench reported to exceed 20,000 tokens.
Reported AIME 2024 accuracy gains described as "substantial gains from 68% to 80%".
Support for inputs of up to 1M tokens and generation lengths of 80K tokens.

Sources

https://arxiv.org/abs/2506.13585v1