AI Model Documentation
Overview
This document provides a comprehensive overview of an AI model designed to enhance preference feedback training (PFT) techniques in reinforcement learning (RL). The model aims to address various challenges in data efficiency, sample complexity, and the performance gap between online and offline training methods.
Architecture
The model employs a two-stage training approach: 1. Training a Reward Model (RM): A classifier trained on preference data to label preferred versus dis-preferred completions. 2. Online Reinforcement Learning: Utilizes the trained RM for policy optimization, focusing on maximizing the likelihood of preferred outputs.
The architecture integrates several key components, including:
- Policy Model: Maps from input prefixes to distributions over next tokens.
- Reward Model: Classifies outputs based on preference data.
- Value Function: Represents the expected return for a given state-action pair.
Goals
The primary goals of the model include:
- Reducing data burden by searching over a reduced subset of policies.
- Achieving equivalence in quality between online and offline PFT techniques.
- Addressing finite-sample limitations in policy training.
- Understanding and mitigating the performance gap between online and offline PFT methods.
- Reducing sample complexity in fine-tuning by focusing on simpler reward models.
Dataset Info
The model requires preference data for training, which can be obtained through:
- Ranking completions generated by policies using the ROUGE-L metric.
- Collecting feedback from a reward model trained on preference data.
- Utilizing datasets such as the SFT dataset and tl;dr data from OpenAI.
Outputs
The model's outputs are evaluated based on:
- Winrate against human-generated references.
- ROUGE-L scores to assess the quality of generated summaries.
- Performance metrics like BoN (Best of N) evaluations.
Relationship to Other Methods
The model builds upon existing methodologies, including:
- DPO (Rafailov et al., 2023)
- PPO (Schulman et al., 2017)
- RLHF (Christiano et al., 2017)
It demonstrates that online approaches can outperform offline methods in fine-tuning and that the generation-verification gap can be effectively addressed through simpler reward models.
Core Objects and Definitions
- Policy Model: A mapping from states to action distributions.
- Reward Model: A classifier trained on preference data.
- Reward Definition: Models constructed to relate rewards to preference probabilities.
- Value Function Definition: Represents the optimal expected return.
Objectives and Losses
The model's primary objective is to maximize the generation likelihood of preferred completions. It employs various loss functions, including:
- DPO loss
- Reverse KL regularization
- Logistic loss
Constraints such as KL regularization and entropy regularization are also utilized to maintain policy probabilities close to the reference policy.
Algorithm
The model follows a structured training pipeline: 1. Stage 1: Train a reward model using maximum likelihood estimation (MLE). 2. Stage 2: Optimize the learned reward and entropy term through online reinforcement learning.
Techniques or Modules
Several techniques are integrated into the model:
- Reward Model (RM): Classifies outputs to simplify policy search.
- Entropy Regularization: Maintains policy probabilities close to a reference policy.
- Online Fine-Tuning: Reduces sample complexity by focusing on simpler reward models.
Theory
Key theoretical insights include:
- Theorems demonstrating the equivalence of RLHF and DPO under certain conditions.
- Lemmas establishing relationships between policy optimization and KL divergence.
Practicalities
The model's training involves specific hyperparameters for different algorithms, ensuring optimal performance during training and evaluation phases.
Compute and Systems
The model requires substantial computational resources, including high-performance GPUs (A100s, A800s, H100s) for efficient inference and training.
Evaluation
Evaluation settings involve sampling with specific temperatures for different metrics, including ROUGE-L and winrate against human references. The model's performance is benchmarked against established datasets.
Limitations and Open Questions
The model acknowledges limitations, particularly regarding the unclear reasons behind the generalization capabilities of global reward models compared to local ones. Further research is needed to address these questions.
This documentation serves as a foundational reference for understanding the AI model's architecture, goals, methodologies, and evaluation strategies.