Apertus AI Model Documentation
Overview
Apertus is a family of large language models (LLMs) designed to address key challenges in multilingual representation, data compliance, and general-purpose language processing. The models include variants such as Apertus-8B and Apertus-70B, which are pre-trained on openly available data and emphasize safety and compliance with dataset licenses.
Key Features and Contributions
- Multilingual Capabilities: Supports 44 to 45 languages, including low-resource languages, and provides comprehensive support for Romansh, Switzerland's fourth national language.
- Data Compliance: Ensures compliance with dataset licenses, retroactively respects crawling permissions, and filters out non-permissive and toxic content.
- Memorization Mitigation: Implements the Goldfish objective to suppress verbatim recall of training data, reducing risks associated with copyright and privacy violations.
- Training Innovations: Utilizes advanced techniques such as the xIELU activation function, AdEMAMix optimizer, QK-Norm, and Pre-Norm to enhance model performance and training stability.
- Long Context Support: Capable of processing sequences of up to 65,536 tokens, enabling better handling of complex queries and tasks.
Problem Statement
Apertus addresses several shortcomings in existing LLMs:
- Lack of reproducible data pipelines and disregard for content-owner rights.
- Insufficient support for lower-resource languages and limited focus on multilinguality.
- Existing models may produce hallucinations and unsafe outputs, failing to align with user expectations regarding helpfulness and safety.
Training and Feedback Mechanisms
- Training Pipeline: The model undergoes a multi-stage training process, including pretraining, long-context training phases, and supervised fine-tuning (SFT).
- Feedback Types: Utilizes both absolute reward signals and relative preferences to optimize model performance.
- Evaluation: Extensive evaluation across multilingual benchmarks, including RULER and specific tasks like coding and mathematical reasoning.
Relationship to Other Models
Apertus builds upon and improves methods from several foundational models, including:
- GPT-J and NVIDIA's Megatron-LM: Leveraging their architectures and optimization strategies.
- Direct Preference Optimization (DPO) and Reinforcement Learning with KL Regularization: Addressing limitations in these existing methods through QRPO, which optimizes absolute reward signals.
Evaluation and Performance
Apertus models have shown strong performance across various benchmarks:
- General Language Understanding: Achieves state-of-the-art results in multilingual benchmarks.
- Cultural Knowledge: Leads among fully open models in cultural knowledge assessments.
- Robustness: Maintains a baseline memorization level and high lexical diversity, demonstrating stability during training.
Benchmark Results
-
Apertus-70B-Instruct:
-
HumanEval (Pass@10): 73.0
- GSM8K: 77.6
-
Apertus-8B:
-
General Language Understanding: 65.8
- HellaSwag: 70.6
Limitations and Open Questions
Despite its strengths, Apertus has limitations:
- Models may still produce hallucinations and toxic outputs.
- Current implementations are focused on language-only tasks and do not handle multimodal inputs.
- Further RL training is needed to enhance alignment and safety measures.
Conclusion
Apertus represents a significant advancement in the development of open, multilingual LLMs, addressing critical gaps in data compliance, multilingual representation, and safety. Its innovative training techniques and robust evaluation results position it as a leading model in the field of AI language processing.