Self-Rewarding PPO (SRPPO) Model Documentation

Overview

Self-Rewarding PPO (SRPPO) is a novel fine-tuning method designed to enhance the performance and generalization of language models (LLMs) by addressing common challenges in supervised fine-tuning (SFT) and reinforcement learning (RL) fine-tuning. SRPPO effectively bridges the gap between these two approaches, enabling robust alignment from demonstration data without the need for extensive human annotations.

Architecture

SRPPO employs a two-step fine-tuning process that combines supervised fine-tuning with Proximal Policy Optimization (PPO). The architecture primarily utilizes a coherent reward mechanism derived from the SFT policy, allowing for on-policy training. The key components include:

SFT Policy: The model is initially fine-tuned on demonstration data to obtain the SFT policy.
Coherent Reward: A novel reward function that quantifies the divergence between the pretrained model and the SFT policy, guiding the fine-tuning process.

Goals

The primary goals of SRPPO are to:

Improve model performance and robustness without requiring additional human annotations.
Enhance out-of-domain generalization and reduce overfitting associated with traditional SFT methods.
Enable effective alignment of language models using demonstration data alone.

Dataset Information

SRPPO requires specific dataset forms for training, including:

Pairs of prompts and desired responses.
Demonstration datasets such as TULU-v2-mix and UltraFeedback.

These datasets are crucial for fine-tuning the model and ensuring effective learning from demonstration data.

Outputs

The expected outputs of the SRPPO model include:

Enhanced alignment and generalization across various tasks, such as instruction following, math reasoning, and conversational abilities.
Improved performance metrics on benchmarks like IFEval, GSM8k, and AlpacaEval, demonstrating the model's effectiveness across different evaluation settings.

Techniques and Modules

Several key techniques and modules are integral to the functioning of SRPPO:

Self-Rewarding Mechanism:
Purpose: Enable on-policy fine-tuning without human preference annotations.
Function: Utilizes the log policy ratio between the SFT model and the pretrained base model as an implicit reward signal.
Expected Effect: Improves generalization, data efficiency, and robustness.
Coherent Reward:
Purpose: Provide a reward signal for fine-tuning.
Function: Derived from the SFT policy, guiding the model's behavior towards alignment.
Expected Effect: Enhances alignment and generalization from a limited amount of demonstration data.
Coherent Reward Mechanism:
Purpose: Quantify the divergence between the pretrained policy and the SFT policy.
Expected Effect: Improves alignment and generalization across various prompts.

Algorithm

The SRPPO algorithm follows a structured approach: 1. Fine-tune a pretrained base model on demonstration data to obtain the SFT policy. 2. Perform on-policy RL fine-tuning using the coherent reward derived from the SFT policy.

This two-step process ensures that the model effectively learns from the demonstration data while maintaining alignment with the desired outcomes.

Evaluation

The model's performance is evaluated using various benchmarks and datasets, including:

Benchmarks: IFEval, GSM8k, GPQA, and AlpacaEval.
Metrics: Various accuracy metrics and win rates across different evaluation settings.

SRPPO has shown to achieve the best overall average scores across these benchmarks, demonstrating its effectiveness in enhancing model performance.

Limitations and Open Questions

While SRPPO addresses many challenges associated with traditional SFT methods, it is essential to consider potential limitations, such as:

Prolonged SFT may lead to overfitting, negatively impacting out-of-domain generalization, particularly in tasks like math reasoning.

Practicalities

Key hyperparameters for training SRPPO include:

SFT batch size: 128
Training epochs: 2
Learning rates: {1 × 10^-5, 5 × 10^-6, 1 × 10^-6, 5 × 10^-7}
PPO rollout buffer size: 1024
KL coefficient: 0.2 or 0.5

The model requires substantial computational resources, including NVIDIA A100 GPUs, to facilitate the training process effectively.

Conclusion

Self-Rewarding PPO (SRPPO) represents a significant advancement in fine-tuning language models, providing a scalable and effective approach to alignment using demonstration data. By addressing the limitations of existing methods and enhancing generalization capabilities, SRPPO sets a new standard for model performance in various applications.

Sources

https://arxiv.org/abs/2510.21090v1