Skip to content

Apertus AI Model Documentation

Overview

Apertus is a family of large language models (LLMs) designed to address key challenges in multilingual representation, data compliance, and general-purpose language processing. The models include variants such as Apertus-8B and Apertus-70B, which are pre-trained on openly available data and emphasize safety and compliance with dataset licenses.

Key Features and Contributions

  • Multilingual Capabilities: Supports 44 to 45 languages, including low-resource languages, and provides comprehensive support for Romansh, Switzerland's fourth national language.
  • Data Compliance: Ensures compliance with dataset licenses, retroactively respects crawling permissions, and filters out non-permissive and toxic content.
  • Memorization Mitigation: Implements the Goldfish objective to suppress verbatim recall of training data, reducing risks associated with copyright and privacy violations.
  • Training Innovations: Utilizes advanced techniques such as the xIELU activation function, AdEMAMix optimizer, QK-Norm, and Pre-Norm to enhance model performance and training stability.
  • Long Context Support: Capable of processing sequences of up to 65,536 tokens, enabling better handling of complex queries and tasks.

Problem Statement

Apertus addresses several shortcomings in existing LLMs:

  • Lack of reproducible data pipelines and disregard for content-owner rights.
  • Insufficient support for lower-resource languages and limited focus on multilinguality.
  • Existing models may produce hallucinations and unsafe outputs, failing to align with user expectations regarding helpfulness and safety.

Training and Feedback Mechanisms

  • Training Pipeline: The model undergoes a multi-stage training process, including pretraining, long-context training phases, and supervised fine-tuning (SFT).
  • Feedback Types: Utilizes both absolute reward signals and relative preferences to optimize model performance.
  • Evaluation: Extensive evaluation across multilingual benchmarks, including RULER and specific tasks like coding and mathematical reasoning.

Relationship to Other Models

Apertus builds upon and improves methods from several foundational models, including:

  • GPT-J and NVIDIA's Megatron-LM: Leveraging their architectures and optimization strategies.
  • Direct Preference Optimization (DPO) and Reinforcement Learning with KL Regularization: Addressing limitations in these existing methods through QRPO, which optimizes absolute reward signals.

Evaluation and Performance

Apertus models have shown strong performance across various benchmarks:

  • General Language Understanding: Achieves state-of-the-art results in multilingual benchmarks.
  • Cultural Knowledge: Leads among fully open models in cultural knowledge assessments.
  • Robustness: Maintains a baseline memorization level and high lexical diversity, demonstrating stability during training.

Benchmark Results

  • Apertus-70B-Instruct:

  • HumanEval (Pass@10): 73.0

  • GSM8K: 77.6
  • Apertus-8B:

  • General Language Understanding: 65.8

  • HellaSwag: 70.6

Limitations and Open Questions

Despite its strengths, Apertus has limitations:

  • Models may still produce hallucinations and toxic outputs.
  • Current implementations are focused on language-only tasks and do not handle multimodal inputs.
  • Further RL training is needed to enhance alignment and safety measures.

Conclusion

Apertus represents a significant advancement in the development of open, multilingual LLMs, addressing critical gaps in data compliance, multilingual representation, and safety. Its innovative training techniques and robust evaluation results position it as a leading model in the field of AI language processing.

Sources

https://arxiv.org/abs/2509.14233v2