Q-SFT (Q-learning via Supervised Fine-tuning)
Overview
Q-SFT, or Q-learning via Supervised Fine-tuning, is an innovative offline reinforcement learning (RL) algorithm designed to enhance the training of large language models (LLMs) and vision-language models (VLMs) for multi-turn tasks. By addressing the challenges associated with scaling value-based RL methods, Q-SFT provides a robust framework for learning Q-values in complex scenarios, such as dialogue systems and robotic control.
Architecture
Q-SFT leverages a modified supervised fine-tuning approach to optimize the Q-function without requiring significant changes to existing model architectures. The algorithm employs a weighted cross-entropy loss that conservatively estimates the value function, allowing for more stable training compared to traditional Q-learning methods. The architecture is built on value-based RL principles, particularly Q-learning, and incorporates techniques from return-conditioned supervised learning (RCSL) and behavioral cloning.
Goals
The primary goals of Q-SFT include:
- Addressing the discrepancies in capabilities required for single-step versus multi-turn tasks.
- Learning Q-values effectively for multi-turn RL problems.
- Improving the stability and performance of value-based RL methods in training LLMs and VLMs.
- Retaining knowledge acquired during pretraining better than existing value-based RL methods.
- Achieving scalable offline reinforcement learning through autoregressive Q-functions.
Dataset Info
Q-SFT requires datasets in the form of trajectories from various tasks, including:
- Chess (625K trajectories)
- Wordle (20K trajectories)
- Twenty Questions (100K conversations)
- WebShop (12K initial user instructions)
- ALFWorld (10K trajectories)
- Randomized, scripted policies (300K trajectories)
The algorithm assumes bounded total rewards and is designed to handle distribution shifts typically encountered in offline RL settings.
Outputs
The outputs of Q-SFT include:
- Q-values learned as probabilities, which provide a conservative estimate of the value function.
- Policy probabilities that maximize cumulative rewards.
- Performance metrics such as success rates and average scores across held-out instructions.
Relationship to Other Methods
Q-SFT builds upon existing methods in the field of reinforcement learning, including:
- Value-based RL (specifically Q-learning)
- Return-conditioned supervised learning (RCSL)
- Behavioral cloning
- State-of-the-art methods like ReAct, ILQL, and CQL
Q-SFT aims to outperform these methods by providing a more stable and effective approach to fine-tuning LLMs for multi-turn tasks.
Techniques or Modules
Key techniques employed in Q-SFT include:
- Weighted Cross Entropy Loss: Aims to conservatively estimate the value function and addresses instability in regression objectives.
- Q-learning: Optimizes agents interacting with a Markov Decision Process (MDP) to learn effective policies for multi-step tasks.
- Behavioral Cloning: Trains a policy from demonstration data, scaling well to complex tasks.
- Offline RL: Tailored for multi-step tasks, improving performance on sequential decision-making tasks.
Evaluation
Q-SFT has undergone comprehensive empirical evaluation across various tasks, demonstrating competitive performance against state-of-the-art methods. Key findings include:
- Q-SFT outperforms existing methods by significant margins in some cases, achieving nearly 30% better performance in specific evaluations.
- The method excels in low-data regimes, showcasing its robustness and efficiency.
Limitations and Open Questions
While Q-SFT presents a significant advancement in offline reinforcement learning, potential limitations include:
- The need for adaptation for online RL optimization.
- The observation that Q-learning approaches may still be outperformed by supervised methods in certain contexts.
In summary, Q-SFT represents a significant step forward in the application of reinforcement learning techniques to large language models, addressing critical challenges in multi-turn task performance while maintaining stability and efficiency in training.