Orthogonal Finetuning for Direct Preference Optimization (RoPO)

Overview

RoPO, short for "Orthogonal Finetuning for Direct Preference Optimization," is an innovative AI model designed to enhance the performance of preference optimization methods. It specifically addresses the overfitting issues prevalent in models tuned with Direct Preference Optimization (DPO). By introducing a novel approach to weight updating, RoPO aims to improve the diversity of generated content while maintaining alignment performance.

Architecture

RoPO builds on several foundational models and techniques, including Direct Preference Optimization (DPO), the Bradley-Terry (BT) model, and various parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA). The architecture employs a unique combination of rotational and magnitude-stretching updates on weight parameters, ensuring that knowledge encoded in the angle between neurons is preserved.

Key Components

Weight Regularization: Maintains hyperspherical energy invariance during weight updates to prevent overfitting.
Givens Rotation: Utilized to construct orthogonal matrices, allowing for efficient rotations in a 2-dimensional subspace.
Bidirectional Integrated Givens (BIG) Matrices: Combines forward and reverse Givens rotation matrices to enhance performance.

Goals

The primary objectives of RoPO include:

Curbing overfitting on dispreferred samples in DPO-tuned models.
Enhancing the diversity of generated content.
Improving alignment performance without sacrificing expressive capacity.

Dataset Information

RoPO requires specific datasets for training and evaluation:

UltraChat-200k: Used for instruction fine-tuning.
UltraFeedback: Utilized for preference optimization. These datasets provide the necessary paired preference triplets and binary signal pairs to train the model effectively.

Outputs

RoPO aims to produce outputs that:

Maximize the likelihood of the preferred responses while minimizing the likelihood of dispreferred ones.
Maintain a balance between alignment performance and generation diversity, resulting in high-quality outputs across various tasks.

Evaluation

RoPO has been evaluated across multiple benchmarks, demonstrating outstanding performance:

MT-Bench: RoPO outperformed DPO by up to 10 points.
AlpacaEval 2: Achieved a performance improvement of up to 2.8 points over DPO.
Commonsense Reasoning Tasks: Maintained strong performance, particularly in QA scenarios.

Evaluation Metrics

Length Weighted Win Rate (WWR)
Average Generation Length
Win Rate (WR)
Reward Model Score (RMS)
Distinct N-grams

Limitations and Open Questions

While RoPO shows significant advancements in addressing overfitting and enhancing diversity, ongoing research is needed to explore its limitations and potential improvements. Further investigation into the robustness of its performance across different datasets and tasks will be essential.

Practicalities

RoPO is designed to be efficient in terms of parameter usage, requiring only 0.0086% of the trainable parameters compared to traditional DPO methods. This efficiency translates to reduced memory usage during training and improved training speed.

Conclusion

RoPO represents a significant step forward in the field of preference optimization, effectively balancing alignment and diversity in generated content. Its innovative approach to weight regularization and architecture positions it as a leading method in the ongoing development of AI models for preference optimization tasks.

Sources

https://arxiv.org/abs/2409.14836v3