Skip to content

Apertus — Model and Training Overview

Summary and goals

Apertus is a family of fully open language models developed by Project Apertus and affiliated groups in the Swiss AI Initiative. Primary released variants include Apertus-8B and Apertus-70B, with instruction-tuned releases Apertus-8B-Instruct and Apertus-70B-Instruct. The project emphasizes data compliance, transparency, and broad multilingual coverage: training on 15T tokens across over 1800 languages with approximately 40% of pretraining data allocated to non-English content. Apertus targets general-purpose language modelling, multilingual representation (including support for Romansh), mitigation of memorization, stable large-scale training, and verifiable data practices.

Architecture and model design

Apertus models are predominantly dense decoder-only Transformer architectures, with experiments and references to mixture-of-experts variants. Core architectural specifications reported include two main scale points mapped to different sets of layer dimensions: layer counts of 32 and 80; hidden sizes of 4096 and 8192; MLP sizes of 21504 and 43008; attention heads 32 and 64; and KV heads of 8.

Several notable design choices and training-stability interventions are integrated into the architecture and training recipe. Key named components are xIELU (an activation function), AdEMAMix (an optimizer variant), and QK-Norm for training stability. Additional choices include grouped-query attention (GQA), Rotary Positional Embeddings with base Θ = 500,000 and NTK-aware RoPE scaling, RMSNorm, untied embeddings and output weights, prevention of cross-document attention via masking, and use of special BoD/EoD tokens. A Goldfish-style objective replaces or augments standard cross-entropy in parts of training to reduce verbatim memorization.

Tokenizer, prompt format, and system messaging

The models use a byte-level BPE tokenizer with vocab size reported as 131072 (noted also as "131k"). Prompt formatting employs a structured chat template with special tokens to delineate user and system prompts. System prompt content includes persona examples (from PersonaHub) and a summary of Swiss AI Charter principles, together with an explicit summary of model identity, origin, and capabilities.

Pretraining data, objectives, and data curation

Pretraining is reported as a multi-stage curriculum trained on approximately 15T tokens (also represented as 15000000000000). A fraction of tokens—reported as "0.3T masked due to Goldfish Loss"—is specifically affected by the memorization-mitigation objective. Stagewise mixtures include curated subsets such as FineWeb-Edu, FineWeb-2-HQ, CommonCrawl subsets targeted to math content ("FineMath"), StarCoder, DCLM-Edu, Clean Wikipedia, Europarl, translation parallel data, institutional books, and specialized corpora (Tulu3, OLMo2). A high-level data mixture claim states 70% Stage 5, 20% FineWeb-Long, and 10% Institutional Books for a later stage.

Filtering and compliance are central: pretraining data are filtered for copyrighted materials, retroactive author opt-outs, toxic content, and personally identifiable information (PII), and robots.txt exclusions are respected. The FineWeb-2 corpus is described as the largest openly available multilingual web-crawl dataset; FineWeb-2-HQ is a high-quality subset for 20 high-resource languages.

Language coverage claims include training on over "1800" / "1811" languages, and post-training interaction support across 149 languages. The non-English fraction of pretraining data is reported as "40%" or "~40%".

The primary pretraining objective highlighted is the Goldfish Loss / Goldfish objective, framed as a Goldfish-style approach to curb verbatim reproduction of training text and alter memorization dynamics.

Training recipe, hyperparameters, and compute

The training recipe emphasizes stability and scaling:

  • Optimizers and schedules: AdEMAMix optimizer combined with a Warmup-Stable-Decay (WSD) learning rate schedule; additional references include cosine schedules and a 1-sqrt decay over 100B tokens in some experiments. Reported optimizer hyperparameters for QRPO include βKL = 5, β3 = 0.99, α = 8.0, β1 = 0.9, β2 = 0.999. Reported learning-rate settings include multiple values: "Learning rate set to 5 × 10 − 7 for the 8B model and 1 × 10 − 7 for the 70B model" and other reported learning rates "5 × 10 − 6 and 2 × 10 − 6" in different stages or ablations.

  • Important hyperparameters and training practices: use of the xIELU activation in MLPs; a 2% token masking rate (k = 50) and a 50-token hashing/context window (h = 50) for Goldfish-style masking; initial batch sizes reported as 1024 for the 8B model and 2048 for the 70B model (with other reported global batch sizes of 504, 512, and 1,024 in specific runs); sequence length commonly 4,096 tokens for many stages; gradient norm clipping set to 0.1 in some runs.

  • Compute and efficiency: training used up to 4096 GPUs; reported compute includes "6 million GPU hours used", "6.74×10^24 FLOPs for training the 70B model", and an estimated 90 days of training on 4096 GPUs with an estimated 5 GWh power usage for the pretraining run. Throughput figures include ~6150 tokens/GPU/s for Apertus-8B and ~780 tokens/GPU/s for Apertus-70B; a claim of "26% higher throughput with FP8 training before rollback" is noted. A single NaN loss instance in Apertus-70B due to hardware failure is reported.

Post-training supervised finetuning and alignment

A supervised finetuning (SFT) phase was applied: SFT is reported as used, with a data mixture covering "149 languages for user interaction" and "approximately 3.8 million examples from diverse sources." The SFT objective is to adapt outputs to structured conversational formats using curated prompt-completion pairs.

Preference alignment was implemented. The pipeline incorporates Quantile Reward Policy Optimization (QRPO) as a primary preference optimization algorithm alongside references to direct preference approaches such as DPO. Reported alignment data includes 380,537 non-controversial prompts and 72,698 controversial prompts, with instruction-following, reasoning, and QA tasks included. Integration of reward-model scores and human preference rankings is reported; length normalization was applied and reported to improve downstream performance for QRPO and DPO, with QRPO outperforming DPO in the 70B model in the reported experiments.

Evaluation highlights and benchmarking

Headline claims state that Apertus achieves state-of-the-art predictive quality among fully open models and strong performance across knowledge, cultural, and instruction-following evaluations. Selected reported results follow.

General knowledge and reasoning:

  • MMLU accuracy: Apertus-8B: 56.9, Apertus-70B: 58.9.
  • Global MMLU accuracy: Apertus-8B: 61.6, Apertus-70B: 65.2.
  • INCLUDE V1 accuracy: Apertus-8B: 55.3, Apertus-70B: 58.2.

Instruction following and coding:

  • HumanEval Pass@10: Apertus-70B-Instruct: 73.0 (Apertus-8B-Instruct not reported for Pass@10).
  • MBPP Pass@1: Apertus-70B-Instruct: 47.0, Apertus-8B-Instruct: 36.2.

Mathematics:

  • GSM8K: Apertus-70B-Instruct: 77.6, Apertus-8B-Instruct: 62.9.

Multilingual and cultural benchmarks:

  • SwitzerlandQA accuracy: Apertus-8B: 62.1, Apertus-70B: 60.2.
  • Cultural Bench accuracy: Apertus-8B: 37.3, Apertus-70B: 38.5.
  • Romansh WMT24++ BLEU: Apertus-8B-Instruct DE→RM 23.0, RM→DE 41.3; Apertus-70B-Instruct DE→RM 27.8, RM→DE 44.7, compared to Llama-3.3-70B-Instruct DE→RM 21.6, RM→DE 35.6.

Throughput and loss ablations:

  • Throughput: ∼6150 tokens/GPU/s for Apertus-8B, ∼780 tokens/GPU/s for Apertus-70B.
  • Training loss ablations show improvements attributed to design choices: baseline 1.5B loss ∼2.037; replacing SwiGLU with xIELU and other changes reduced loss to 1.997; switching AdamW→AdEMAMix reduced loss to 2.002 in ablation runs. Comparative CE loss after first 20k steps versus OLMo2 shows Apertus models with lower early loss in reported comparisons (e.g., Apertus 1B ∼2.75 vs OLMo2 1B ∼2.84).

Where Apertus is reported to excel includes improved stability and gradient norms, reduced token requirements to match baseline losses in smaller models, strong multilingual performance and cultural-knowledge metrics among fully open models, and improved low-resource translation compared to select baselines.

Where Apertus is reported weaker includes relative math and coding performance compared to the very top closed or open-weight models after additional RL training; average performance reductions due to license filtering (reported reductions such as 5.8% average performance loss) and a smaller-than-typical scaling gap between 8B and 70B variants. Greedy decoding is noted to lead to text degeneration in generation evaluations.

Key innovations

  • Goldfish Loss for selective token masking and memorization mitigation.
  • xIELU activation function used in MLP sublayers.
  • AdEMAMix optimizer and a Warmup-Stable-Decay learning-rate schedule.
  • QK-Norm and grouped-query attention to improve training stability and efficiency.
  • Adoption of QRPO (Quantile Reward Policy Optimization) and length-normalized reward optimization in alignment.

Limitations and caveats

Models may still produce hallucinations, toxic outputs, and other unsafe behaviors. Apertus is a language-only family and does not process non-text modalities. The tokenizer vocabulary size (131k) is larger than some comparators (OLMo2 at 100k) and is reported as a potential driver of higher average cross-entropy loss. Filtering and license-based exclusions are reported to have introduced estimated token-losses of approximately 8% in English data and 4% in multilingual data, with license filtering reducing average downstream performance in some ablations. The Goldfish Loss is reported as fragile to near-duplicates. The family has not undergone additional reinforcement-learning post-training variants (e.g., RLVR) that are noted to enhance math and coding capabilities. Safety and security evaluation work is reported to focus heavily on English and may generalize poorly across other languages. The open-weight nature of the release introduces a risk that post-training modifications could revert safety guardrails.

Notable reported numbers and quotes

Training and resource figures include "15T tokens" pretraining (also noted as 15000000000000), "0.3T masked due to Goldfish Loss", "6 million GPU hours used", "6.74×10^24 FLOPs for training the 70B model", estimated "90 days" on 4096 GPUs, and "5 GWh" estimated power usage. Reported training-loss landmarks: baseline training loss ∼2.037; with xIELU 1.997; with AdEMAMix 2.002. Throughput claim of "26% higher throughput with FP8 training before rollback" is reported. Evaluation and community engagement statistics include "71.8% of responses indicate that the chatbot should always or definitely follow the principle" and "91.3% agreement for Article 4 on Preventing Harm" in the Swiss AI Charter survey. The evaluation suite covered "94 different languages" in total.

Baselines and comparators

Apertus comparisons cited a range of open and open-weight model families including LLaMA, Qwen, OLMo, EuroLLM, and Gemma across parameter sizes from approximately 3B to 72B. Many reported benchmarks compare Apertus variants to these families and to other instruct-style models such as Llama-3.1/3.3 and Qwen variants.

Limitations and open questions remain and have been reported explicitly in evaluation and ablation sections.

Sources

https://arxiv.org/abs/2509.14233v2