GLM-4.5: Architecture, Training, Variants, and Evaluation

Overview

GLM-4.5 is an open-source research line developed by Zhipu AI in collaboration with Tsinghua University, positioned to advance agentic abilities, complex reasoning, and coding performance. The design emphasizes improved logical deduction, structured problem-solving, and verifiable accuracy, while also targeting efficient reinforcement-learning (RL) training and flexible training paradigms for agentic tasks. Key claimed contributions include a hybrid reasoning method, a Mixture-of-Experts design, and a suite of training and evaluation techniques aimed at strong performance across reasoning, code, translation, and long-context understanding.

Architecture and Notable Design Choices

The architecture is based on a Mixture-of-Experts (MoE) approach with several targeted design decisions to enhance reasoning capacity and long-context modeling.

Core architectural features

The full GLM-4.5 configuration uses a tall-and-narrow design (reduced width, increased height) and integrates MoE layers for capacity scaling. Attention is implemented with high head counts and grouped-query strategies to improve throughput and reasoning. The system also introduces mechanisms to stabilize attention and support speculative decoding.

Key design elements include:

Grouped-Query Attention and partial RoPE to improve attention efficiency and long-context handling.
QK-Norm to stabilize attention logits.
Sigmoid gating and loss-free balance routing for MoE layers.
An MoE layer used as an MTP (Mixture-of-Experts Temporal Predictor) for speculative decoding.
Maximum sequence length evolution during training: pre-training at 4,096 tokens and mid-training extended to 32,768–131,072 tokens; RoPE base frequency adjusted from 10,000 to 1,000,000 to support long contexts.

Training and system-level design choices

The model supports both synchronous and asynchronous training and a decoupled design for training and rollout engines. The RL framework uses GRPO (excluding KL loss term) and introduces dynamic sampling temperature adjustment to maintain exploration–accuracy balance. There is also an iterative self-distillation approach and a multi-source feedback system combining rule-based, human (RLHF), and model-based (RLAIF) feedback. Support exists for colocated and disaggregated training modes.

Variants and Parameterization

Three named variants are described across the project: GLM-4.5, GLM-4.5-Air, and GLM-4.5-Base. Available specification details are summarized for each variant where provided.

GLM-4.5 (largest configuration)

GLM-4.5 is presented as the largest configuration in the family. Reported architecture and capacity details:

Total Parameters: 355B
Activated Parameters: 32B
Dense Layers: 3
MoE Layers: 89
MTP Layers: 1
Hidden Dimension: 5120
Dense Intermediate Dimension: 12288
MoE Intermediate Dimension: 1536
Attention Head Dimension: 128
Attention Heads: 96
Key-Value Heads: 8
Experts (total): 160
Experts Active Per Token: 8
Shared Experts: 1
QK-Norm: Yes

GLM-4.5-Air (smaller MoE variant)

GLM-4.5-Air is described as a smaller, more compact MoE variant with the following reported specs:

Total Parameters: 106B
Activated Parameters: 12B
Dense Layers: 1
MoE Layers: 45
MTP Layers: 1
Hidden Dimension: 4096
Dense Intermediate Dimension: 10944
MoE Intermediate Dimension: 1408
Attention Head Dimension: 128
Attention Heads: 96
Key-Value Heads: 8
Experts (total): 128
Experts Active Per Token: 8
Shared Experts: 1
QK-Norm: No

GLM-4.5-Base

The GLM-4.5-Base variant is noted as a base model specialized for broad multimodal/textual capabilities and language coverage. Supported content types explicitly listed include English, Code, Math, and Chinese.

Training Data, Objectives, and Hyperparameters

Pre-training and post-training (fine-tuning/SFT/RL) workflows include multiple stages and specialized objectives.

Pre-training summary:

Total tokens: 23T tokens
Data mixture: Documents from webpages, social media, books, papers, and code repositories; majority are English and Chinese webpages; multilingual documents included from crawled webpages and Fineweb-2; source code from GitHub and other hosting platforms; mathematical and scientific documents from webpages, books, and papers.
Objective for source code: Fill-In-the-Middle objective.

Optimizer and schedule:

Muon optimizer used for all parameters except word embeddings, biases, and RMSNorm weights.
Cosine decay learning rate schedule.
Learning rate warm-up from 0 to 2.5e-4, decaying to 2.5e-5.
Group-wise policy optimization algorithm for RL training.

Important hyperparameters and training practices:

Newton-Schulz iteration steps N: 5
Momentum µ: 0.95
Scaled Muon update RMS: 0.2
Weight decay ratio: 0.1
Batch size warmup from 16M tokens to 64M tokens
Sampling temperature (dynamically adjusted during RL)
Maximum sequence length progression as described in architecture

Post-training and SFT:

Supervised Fine-Tuning (SFT) applied at the beginning of both Stage 1 and Stage 2.
SFT data: a small set with extended Chain-of-Thought (CoT) responses, plus millions of samples covering reasoning tasks, general chat, agentic tasks, and long-context understanding tasks.
Objectives for SFT: Provide a cold start for chat, reasoning, and tool use, and distill capabilities from different experts into a hybrid reasoning generalist.

Evaluation and Benchmark Performance

GLM-4.5 was evaluated across a broad set of benchmarks spanning reasoning, code, translation, long-context, and agentic tasks. The model claims strong comparative performance on many benchmarks and competitive results versus both open- and closed-source models.

Headline outcomes:

Claimed highest overall score in English, Chinese, and other languages.
Pairwise comparisons: vs Claude Sonnet 4 — 40.4% win, 9.6% tie, 50.0% loss; vs Kimi K2 — 53.9% win, 17.3% tie, 28.8% loss; vs Qwen3-Coder — 80.8% win, 7.7% tie, 11.5% loss.
Tool calling reliability: 90.6% success rate.

Selected benchmark highlights and reported numbers:

AIME 24: Score 91.0 (GLM-4.5)
TAU-Bench: Score 70.1; other TAU variants reported: GLM-4.5 79.7 (Retail) and 60.4 (Airline)
MMLU: Score 90.0 (GLM-4.5); MMLU-Pro accuracy GLM-4.5: 84.6, GLM-4.5-Air: 81.4
LiveCodeBench: Score 72.9; LiveCodeBench-Base Pass@1: GLM-4.5: 28.1
GSM8K: EM 79.4
MATH: EM 61.0
BBH: EM 86.2
HellaSwag: EM 87.1
PIQA: EM 85.3
TriviaQA: EM 80.0
EvalPlus: Pass@1 78.1
CC-Bench: Strong task-completion performance relative to open-source baselines; competitive against closed-source models
SafetyBench: Safety score 89.87 (GLM-4.5)
Humanity's Last Exam (HLE): Accuracy 14.4
Novel Logical Reasoning Problems: Expert evaluation score 62.0

Cross-benchmark comparisons noted:

GLM-4.5 reportedly outperforms several strong baselines on targeted benchmarks (for example, outperforming Claude Opus 4 on average and exceeding certain models on coding and SWE-bench Verified). Reported relative rankings include overall 3rd place among evaluated models and 2nd on agentic benchmarks.

Where GLM-4.5 Excels and Where It Is Weaker

Strengths: GLM-4.5 is positioned to excel in agentic execution, code tasks, mathematical reasoning, objective QA, long-context understanding, and multilingual text generation. The system claims notable strengths in tool calling reliability, task completion consistency, and performance improvements from RL training on web search and software-engineering tasks. The two-stage difficulty-based curriculum and token-weighted mean loss for code RL are reported to produce faster convergence and improved performance.

Weaknesses and comparative gaps: Areas of relative weakness include fairness and bias concerns and specific lower scores on highly adversarial benchmarks such as Humanity's Last Exam (HLE). Comparisons show closeness to some competitor models (for example DeepSeek-R1-0528) in certain metrics.

Limitations and Open Questions

Character escaping in function-call templates presents a challenge for agentic foundation models and was explicitly called out.
Introducing RL stages with shorter maximum lengths can cause the model to unlearn long-context capabilities, indicating trade-offs when changing sequence-length regimes during RL fine-tuning.
The training approach intentionally balances synchronous and asynchronous modes and proposes dynamic adjustments, but the trade-offs between throughput and long-horizon rollout performance remain central considerations.

Key Quantitative Highlights

355B total parameters (largest configuration)
32B activated parameters (GLM-4.5)
106B total parameters for GLM-4.5-Air
Tool calling reliability: 90.6%
AIME 24 accuracy: 91.0

Sources

https://arxiv.org/abs/2508.06471v1