Q-SFT (Q-learning via Supervised Fine-tuning)

Overview

Q-SFT, or Q-learning via Supervised Fine-tuning, is an innovative offline reinforcement learning (RL) algorithm designed to enhance the training of large language models (LLMs) and vision-language models (VLMs) for multi-turn tasks. By addressing the challenges associated with scaling value-based RL methods, Q-SFT provides a robust framework for learning Q-values in complex scenarios, such as dialogue systems and robotic control.

Architecture

Q-SFT leverages a modified supervised fine-tuning approach to optimize the Q-function without requiring significant changes to existing model architectures. The algorithm employs a weighted cross-entropy loss that conservatively estimates the value function, allowing for more stable training compared to traditional Q-learning methods. The architecture is built on value-based RL principles, particularly Q-learning, and incorporates techniques from return-conditioned supervised learning (RCSL) and behavioral cloning.

Goals

The primary goals of Q-SFT include:

Addressing the discrepancies in capabilities required for single-step versus multi-turn tasks.
Learning Q-values effectively for multi-turn RL problems.
Improving the stability and performance of value-based RL methods in training LLMs and VLMs.
Retaining knowledge acquired during pretraining better than existing value-based RL methods.
Achieving scalable offline reinforcement learning through autoregressive Q-functions.

Dataset Info

Q-SFT requires datasets in the form of trajectories from various tasks, including:

Chess (625K trajectories)
Wordle (20K trajectories)
Twenty Questions (100K conversations)
WebShop (12K initial user instructions)
ALFWorld (10K trajectories)
Randomized, scripted policies (300K trajectories)

The algorithm assumes bounded total rewards and is designed to handle distribution shifts typically encountered in offline RL settings.

Outputs

The outputs of Q-SFT include:

Q-values learned as probabilities, which provide a conservative estimate of the value function.
Policy probabilities that maximize cumulative rewards.
Performance metrics such as success rates and average scores across held-out instructions.

Relationship to Other Methods

Q-SFT builds upon existing methods in the field of reinforcement learning, including:

Value-based RL (specifically Q-learning)
Return-conditioned supervised learning (RCSL)
Behavioral cloning
State-of-the-art methods like ReAct, ILQL, and CQL

Q-SFT aims to outperform these methods by providing a more stable and effective approach to fine-tuning LLMs for multi-turn tasks.

Techniques or Modules

Key techniques employed in Q-SFT include:

Weighted Cross Entropy Loss: Aims to conservatively estimate the value function and addresses instability in regression objectives.
Q-learning: Optimizes agents interacting with a Markov Decision Process (MDP) to learn effective policies for multi-step tasks.
Behavioral Cloning: Trains a policy from demonstration data, scaling well to complex tasks.
Offline RL: Tailored for multi-step tasks, improving performance on sequential decision-making tasks.

Evaluation

Q-SFT has undergone comprehensive empirical evaluation across various tasks, demonstrating competitive performance against state-of-the-art methods. Key findings include:

Q-SFT outperforms existing methods by significant margins in some cases, achieving nearly 30% better performance in specific evaluations.
The method excels in low-data regimes, showcasing its robustness and efficiency.

Limitations and Open Questions

While Q-SFT presents a significant advancement in offline reinforcement learning, potential limitations include:

The need for adaptation for online RL optimization.
The observation that Q-learning approaches may still be outperformed by supervised methods in certain contexts.

In summary, Q-SFT represents a significant step forward in the application of reinforcement learning techniques to large language models, addressing critical challenges in multi-turn task performance while maintaining stability and efficiency in training.

Sources

https://arxiv.org/abs/2411.05193v2