Contrastive Preference Optimization (CPO)

Overview

Contrastive Preference Optimization (CPO) is an innovative approach designed to enhance the performance of moderate-sized language models (LLMs) in machine translation tasks. By utilizing specially curated preference data, CPO addresses the limitations of traditional supervised fine-tuning (SFT) and direct preference optimization (DPO), enabling models to prioritize higher-quality translations while effectively rejecting inferior outputs.

Architecture

CPO operates by introducing a new training objective that moves beyond merely minimizing cross-entropy loss towards a reference. It employs a parameterized policy (π θ) and a reference model (π ref), which can be approximated as a uniform prior (U). The architecture emphasizes reference-free evaluation models, allowing for a more accurate assessment of translation quality without reliance on potentially flawed gold references.

Goals

The primary goals of CPO include:

Bridging the performance gap between moderate-sized LLMs (7B or 13B parameters) and state-of-the-art translation models.
Training models to discern and prioritize high-quality translations while avoiding suboptimal outputs.
Enhancing the overall performance of translation models while questioning the reliability of traditional evaluation methods.

Dataset Information

CPO requires specific datasets to function effectively:

Required Dataset Forms: 22K parallel sentences and triplet preference data.
Supported Dataset Types: Preference data from the FLORES-200 dataset.
Paired Preference Triplets: Constructed using FLORES-200 data, consisting of triplets (y_ref, y_gpt-4, y_alma) derived from 20K paired sentences across 10 translation directions.
Preference Data Acquisition: Utilizes specially curated preference data, including 1K internal human-labeled preference data and selections made by reference-free models.

Outputs

CPO aims to produce superior translations by training models to refine details and improve overall translation quality. The expected effects include:

Marked improvements in translation performance.
Enhanced capabilities of models like ALMA, bringing performance comparable to or surpassing GPT-4 and WMT competition winners.
Significant improvements across all translation directions.

Relationship to Other Methods

CPO builds upon existing methodologies, including:

Supervised Fine-Tuning (SFT): Acknowledges the limitations of SFT, which caps model performance at the quality level of the training data.
Direct Preference Optimization (DPO): Addresses the inefficiencies associated with DPO, such as memory and speed constraints.
CPO demonstrates significant performance improvements compared to both SFT and DPO, particularly in translation tasks.

Techniques and Modules

CPO incorporates several key techniques:

Contrastive Preference Optimization: Trains models to avoid generating adequate but imperfect translations, mitigating the shortcomings of SFT.
Manually Noised Data: Creates dis-preferred translations through random deletions and swaps, following the method suggested by Zeng et al. (2023).

Evaluation

CPO's effectiveness is evaluated using various settings and benchmarks:

Evaluation Settings: Includes WMT'21, WMT'22, and reference-free evaluation models such as KIWI-XXL and XCOMET.
Base Models Used: Evaluations involve models like ALMA, ALMA-13B-LoRA, and GPT-4.
Headline Results: CPO leads to significant performance enhancements, with ALMA-13B-R achieving high scores on KIWI-XXL and XCOMET.

Limitations and Open Questions

While CPO demonstrates substantial improvements, it also faces challenges, such as:

The inherent memory inefficiencies of DPO compared to SFT.
Potential performance drawbacks when not utilizing CPO, as models may slightly trail behind top competitors like GPT-4.

In summary, Contrastive Preference Optimization presents a robust framework for enhancing machine translation models, addressing key limitations of existing methods, and paving the way for future advancements in translation quality.

Sources

https://arxiv.org/abs/2401.08417v4