Contrastive Preference Optimization (CPO)
Overview
Contrastive Preference Optimization (CPO) is an innovative approach designed to enhance the performance of moderate-sized language models (LLMs) in machine translation tasks. By utilizing specially curated preference data, CPO addresses the limitations of traditional supervised fine-tuning (SFT) and direct preference optimization (DPO), enabling models to prioritize higher-quality translations while effectively rejecting inferior outputs.
Architecture
CPO operates by introducing a new training objective that moves beyond merely minimizing cross-entropy loss towards a reference. It employs a parameterized policy (π θ) and a reference model (π ref), which can be approximated as a uniform prior (U). The architecture emphasizes reference-free evaluation models, allowing for a more accurate assessment of translation quality without reliance on potentially flawed gold references.
Goals
The primary goals of CPO include:
- Bridging the performance gap between moderate-sized LLMs (7B or 13B parameters) and state-of-the-art translation models.
- Training models to discern and prioritize high-quality translations while avoiding suboptimal outputs.
- Enhancing the overall performance of translation models while questioning the reliability of traditional evaluation methods.
Dataset Information
CPO requires specific datasets to function effectively:
- Required Dataset Forms: 22K parallel sentences and triplet preference data.
- Supported Dataset Types: Preference data from the FLORES-200 dataset.
- Paired Preference Triplets: Constructed using FLORES-200 data, consisting of triplets (y_ref, y_gpt-4, y_alma) derived from 20K paired sentences across 10 translation directions.
- Preference Data Acquisition: Utilizes specially curated preference data, including 1K internal human-labeled preference data and selections made by reference-free models.
Outputs
CPO aims to produce superior translations by training models to refine details and improve overall translation quality. The expected effects include:
- Marked improvements in translation performance.
- Enhanced capabilities of models like ALMA, bringing performance comparable to or surpassing GPT-4 and WMT competition winners.
- Significant improvements across all translation directions.
Relationship to Other Methods
CPO builds upon existing methodologies, including:
- Supervised Fine-Tuning (SFT): Acknowledges the limitations of SFT, which caps model performance at the quality level of the training data.
- Direct Preference Optimization (DPO): Addresses the inefficiencies associated with DPO, such as memory and speed constraints.
- CPO demonstrates significant performance improvements compared to both SFT and DPO, particularly in translation tasks.
Techniques and Modules
CPO incorporates several key techniques:
- Contrastive Preference Optimization: Trains models to avoid generating adequate but imperfect translations, mitigating the shortcomings of SFT.
- Manually Noised Data: Creates dis-preferred translations through random deletions and swaps, following the method suggested by Zeng et al. (2023).
Evaluation
CPO's effectiveness is evaluated using various settings and benchmarks:
- Evaluation Settings: Includes WMT'21, WMT'22, and reference-free evaluation models such as KIWI-XXL and XCOMET.
- Base Models Used: Evaluations involve models like ALMA, ALMA-13B-LoRA, and GPT-4.
- Headline Results: CPO leads to significant performance enhancements, with ALMA-13B-R achieving high scores on KIWI-XXL and XCOMET.
Limitations and Open Questions
While CPO demonstrates substantial improvements, it also faces challenges, such as:
- The inherent memory inefficiencies of DPO compared to SFT.
- Potential performance drawbacks when not utilizing CPO, as models may slightly trail behind top competitors like GPT-4.
In summary, Contrastive Preference Optimization presents a robust framework for enhancing machine translation models, addressing key limitations of existing methods, and paving the way for future advancements in translation quality.