Nash Learning from Human Feedback (NLHF)

Overview

Nash Learning from Human Feedback (NLHF) is an innovative approach to fine-tuning large language models (LLMs) through pairwise human feedback. This model addresses the limitations of traditional Reinforcement Learning from Human Feedback (RLHF) and the Bradley-Terry model, which often fail to capture the full range of human preferences. By computing the Nash equilibrium for a generalized preference model, NLHF optimizes for preference feedback without necessitating a reward function, thereby aligning models more effectively with human preferences.

Architecture

NLHF introduces several key components and algorithms that enhance its functionality:

Nash-MD: A mirror descent-based algorithm that optimizes policies by generating and comparing responses against a mixture of the current model and a past model.
Nash-EMA: An exponential moving average approach that reduces memory requirements by not needing to store all past policies.
Regularized Preference Model: Ensures accurate estimation of preferences while maintaining proximity to a known safe policy.

The architecture is designed to compute the Nash equilibrium of the preference model, focusing on policy optimization through a two-player game framework based on preferences.

Goals

The primary objectives of NLHF include:

Calculating the Nash equilibrium of the preference model.
Optimizing the policy πθ using a regularized policy gradient algorithm.
Finding a policy π* that is preferred over any alternative policy.

Dataset Info

NLHF operates on the TL;DR dataset, which is utilized for text summarization tasks. The model requires:

Pairwise evaluations of model performance.
Feedback obtained through AI-generated comparisons and expert selections between proposed trajectories.

The model assumes that every context has a positive probability assigned by the distribution ρ, ensuring comprehensive coverage of preferences.

Outputs

The outputs of the NLHF model include:

A set of optimized policies that reflect human preferences.
Preference signals that indicate the likelihood of one response being favored over another given a specific context.

Relationship to Other Methods

NLHF builds on several foundational concepts and algorithms:

Mirror Descent: Provides a framework for policy optimization.
Fictitious Play: Offers insights into policy evaluation against a mixture of past policies.
Online Convex Optimization: Facilitates regret minimization in policy learning.

NLHF is most closely related to RLHF but avoids the traditional reward model, focusing instead on direct preference optimization. Comparisons have shown that NLHF is more aligned with the diversity of human preferences and less sensitive to changes in preference distribution than RLHF.

Core Objects and Definitions

Key components of NLHF include:

Policy Model: Represents the current policy πθ and alternative policies.
Preference Signal: Computed as P(y ≻ y′ | x), indicating the probability that a randomly chosen human prefers response y over y′ given context x.

The model employs several equations to define relationships between policies and preferences, ensuring a robust framework for optimization.

Objectives and Losses

The model's primary objective is to optimize policy πθ while minimizing the cross-entropy loss defined as:

l_t(π) = -P_τ(π ≻ π_t)

Additionally, KL-regularization is used to maintain proximity to a reference policy throughout the optimization process.

Algorithm

NLHF employs a deep reinforcement learning algorithm that focuses on learning a preference model and computing the Nash equilibrium. The training process is structured to optimize policies based on direct preferences rather than learned reward functions.

Techniques or Modules

Several techniques underpin the NLHF framework:

NashMD: Converges to the Nash equilibrium without retaining intermediate policies.
Nash-EMA: Provides a memory-efficient approximation of past policies.
Regularized Preference Model: Incorporates KL-regularization to ensure accurate preference modeling.

Evaluation

The evaluation of NLHF is conducted through pairwise comparisons using the PaLM 2 Large LLM, demonstrating its effectiveness in text summarization tasks. The model has shown superior performance in pairwise comparisons against other models.

Limitations and Open Questions

One notable limitation of NLHF is the challenge in making fair comparisons with RLHF due to the reliance on different models. Further exploration is needed to address this and enhance the robustness of the model.

In summary, NLHF represents a significant advancement in aligning LLMs with human preferences through innovative methodologies that prioritize direct preference optimization and the computation of Nash equilibria.

Sources

https://arxiv.org/abs/2312.00886v4