Self-NPO: Negative Preference Optimization Model

Overview

Self-NPO, also known as Negative Preference Optimization (NPO), Diffusion-NPO, and Truncated Diffusion Fine-tuning, is an innovative AI model designed to enhance generative capabilities while minimizing reliance on explicit preference annotations. This model addresses several critical issues in generative modeling, including the reduction of undesired outputs and computational costs associated with fine-tuning diffusion models.

Architecture

The architecture of Self-NPO builds upon existing methods such as Reinforcement Learning with Human Feedback (RLHF) and Classifier-Free Guidance (CFG). It introduces a data-free approach for negative preference optimization through truncated diffusion fine-tuning, allowing the model to learn directly from self-generated data.

Key Techniques

Truncated Diffusion Fine-Tuning: A method that updates the model using partially generated diffusion samples, reducing the computational burden of traditional full diffusion simulations.
Controlled Weakening of Generative Ability: This technique reduces undesired outputs by strategically weakening the generative capabilities of the model.

Goals

The primary objectives of Self-NPO include:

To apply negative preference optimization to pretrained models.
To maximize the expectation value of the associated reward model while maintaining proximity to a reference distribution.
To improve training efficiency and reduce the need for manual data labeling or reward model training.

Dataset Information

Self-NPO leverages self-generated data, eliminating the need for explicit preference annotations. This approach allows the model to learn directly from the outputs of the diffusion model itself, thereby streamlining the training process.

Outputs

The model aims to enhance the aesthetic quality of generated images, improving high-frequency details, color, lighting preferences, and overall composition. The expected outputs are aligned with human preferences, leading to a more satisfactory generative experience.

Relationship to Other Methods

Self-NPO builds on established methods such as:

Reinforcement Learning with Human Feedback (RLHF)
Classifier-Free Guidance (CFG)

It is most closely related to Negative Preference Optimization (Diffusion-NPO) and Direct Preference Optimization (DPO). Notably, Self-NPO achieves comparable performance to Diffusion-NPO while being data-free, making it a significant advancement in the field.

Core Objects and Definitions

Reference Model: P ref ( x 0 | c )
Reward Definition: R ( x , c ) ∈ [0 , 1], representing a reward model like HPSv2.
Key Equations:
ϵ ω = ( ω +1) ϵ θ pos ( x t , t, c ) -ω ϵ θ neg ( x t , t, c ′ )
∇ x t log P θ ( x t | c ; t ) = -ϵ θ ( x t ,t ) σ t

Objectives and Losses

The primary objective of Self-NPO is to maximize the reward model while ensuring the model remains close to a reference distribution. The losses mentioned include standard diffusion loss, with derived losses focusing on the learning objective for NPO.

Algorithm

Self-NPO employs truncated diffusion fine-tuning for negative preference optimization, adapting existing methods without the need for new datasets or strategies. The training pipeline involves fine-tuning on partially executed iterative diffusion generation results, leading to enhanced efficiency.

Practicalities

Self-NPO is designed to be computationally efficient, with training costs significantly lower than those of traditional methods. For example, Self-NPO trained with 1,000 iterations took approximately 0.5 hours on 4 A800 GPUs, whereas the baseline method required about 2.6 hours.

Evaluation

The model has been evaluated using various base models, including SD1.5 and SDXL, and datasets such as Pick-a-pic. Key metrics for evaluation include PickScore, HPSv2.1, ImageReward, and Laion-Aesthetic. Self-NPO consistently demonstrates improvements in generation quality and alignment with human preferences, although it may have a lower performance bound compared to NPO.

Limitations and Open Questions

While Self-NPO offers significant advancements, it is essential to explore its limitations and potential areas for improvement in future research.

This comprehensive overview encapsulates the Self-NPO model's architecture, goals, and methodologies, highlighting its contributions to the field of generative modeling.

Sources

https://arxiv.org/abs/2505.11777v2