AI Model Documentation: GRPO, DPO, Online DPO

Overview

The AI model encompasses three variants: GRPO (Group Relative Policy Optimization), DPO (Direct Preference Optimization), and Online DPO. These models bridge the gap between offline and online reinforcement learning for large language models (LLMs). They address various challenges in instruction following, reasoning across tasks, and performance optimization without relying heavily on reward models.

Architecture

The architecture of the models is built on principles from Proximal Policy Optimization (PPO) and asynchronous reinforcement learning. GRPO is an on-policy algorithm that requires synchronous updates, while DPO is designed for offline training with pre-generated responses. The models utilize the Athene-RM-8B as a reward model for preference ranking and employ group responses to approximate relative advantages.

Goals

The primary goals of the models include:

Optimizing for human preferences or verifiable rewards.
Improving reasoning across a wide range of tasks beyond specific domains.
Enhancing performance in instruction following tasks.
Mitigating response length bias in reward models.
Eliminating the need for a reward model by directly optimizing preferred outputs using pairwise comparisons.

Dataset Information

The models utilize two primary datasets:

WildChat-1M: Used for non-verifiable tasks, focusing on user interactions.
NuminaMath: Used for verifiable tasks, providing annotated instruction following datasets with human preference labels.

Preference pairs are created from the highest and lowest scores from the Athene-RM-8B model.

Outputs

The models produce outputs based on scalar and binary rewards:

Scalar rewards for non-verifiable tasks.
Binary rewards for verifiable tasks.

The expected outcomes include improved performance on both verifiable and non-verifiable tasks, with significant gains observed in benchmarks such as AlpacaEval and Arena-Hard.

Relationship to Other Methods

The models build on existing methods like PPO and DPO, addressing their limitations:

Online and semi-online DPO objectives outperform traditional offline methods.
DPO is less noisy than PPO or GRPO.
Semi-online DPO, online DPO, and GRPO show comparable performance, significantly outperforming seed models.

Techniques and Modules

Several techniques are integrated into the models to enhance their performance:

Direct Preference Optimization (DPO): Fine-tunes LLMs using preference labels, reducing reliance on human annotators.
Group Relative Policy Optimization (GRPO): Fine-tunes LLMs in an online manner, improving the training of reasoning LLMs.
Combination of Rewards: Integrates different reward types into a single training run, addressing reward hacking and performance issues.
Entropy Regularization: Mitigates entropy collapse in DPO by adjusting the training objective.
Length Penalty: Reduces length bias in reward models by incorporating a penalty in the loss function.

Evaluation

The models are evaluated using various benchmarks:

AlpacaEval 2.0 and Arena-Hard benchmarks assess performance on both verifiable and non-verifiable tasks.
Metrics include win rates, accuracy, and scores with confidence intervals.

Headline Results

Semi-online DPO, online DPO, and GRPO show significant improvements over the seed model.
Online DPO achieves a 56.6% increase in AlpacaEval win rate and a 45.6% increase in Arena-Hard score compared to offline DPO.

Limitations and Open Questions

Despite their advancements, the models face challenges:

The theoretical clarity of off-policy updates with GRPO remains underexplored.
The optimal methods for combining different reward types are still being investigated.

Conclusion

The GRPO, DPO, and Online DPO models represent significant advancements in the integration of offline and online reinforcement learning for LLMs, addressing critical challenges in instruction following and performance optimization across a variety of tasks.

Sources

https://arxiv.org/abs/2506.21495v1