Iterative Foundation Model Fine-Tuning on Multiple Rewards (MORLHF)

Overview

The Iterative Foundation Model Fine-Tuning on Multiple Rewards (MORLHF), also referred to as Rewarded Soups or IterativeRS, is a novel approach designed to optimize foundation models using multiple reward signals. This method addresses the limitations of existing techniques that typically rely on a single reward signal or independently fine-tune models for each objective, which can lead to performance variance and suboptimal results.

Architecture

MORLHF employs a multi-objective reinforcement learning framework that integrates an iterative fine-tuning strategy. This strategy involves merging expert policies to control variance and improve performance across multiple objectives. The architecture includes:

Policy Model: Denoted by π θ, representing the policy of a language model parameterized by θ.
Iterative Fine-Tuning: A process that breaks down fine-tuning into smaller steps and merges expert policies iteratively.
Reward Combining: A method that applies reinforcement learning with combined rewards to optimize multi-objective fine-tuning.

Goals

The primary goals of MORLHF include:

Optimizing foundation models using multiple reward signals.
Reducing performance variance across multiple objectives during fine-tuning.
Generating small molecules and DNA sequences with specific desirable properties.
Providing a robust framework for multi-objective reinforcement learning.

Dataset Info

MORLHF requires specific datasets for effective training:

Required Dataset Forms:
QM9 dataset
A labeled subset of 100,000 DNA sequences with corresponding activity measurements.
Supported Dataset Types:
MPRA dataset for DNA sequences
Reddit Summary dataset for text summarization

The method also assumes that the distribution of data used to pre-train the foundation model aligns closely with the supervised training data for DNA sequences.

Outputs

MORLHF generates outputs based on the fine-tuned models, which include:

Small molecules with optimal properties.
DNA sequences exhibiting maximal regulatory activity across specific cell lines.
Text summaries that meet specified evaluation criteria.

Evaluation

The evaluation of MORLHF involves various metrics and benchmarks:

Base Models Used: PAMNet, MolGPT-2, GPT-2 for DNA sequence generation, and Llama-3.2-3B-Instruct for text summarization.
Benchmarks Metrics: Average reward, ICV score, ROUGE scores, and other relevant metrics.
Headline Results: MORLHF demonstrates effectiveness across diverse tasks, achieving higher average rewards compared to baseline methods such as MORLHF and Rewarded Soups.

Relationship to Other Methods

MORLHF builds on existing techniques in reinforcement learning, particularly:

Reinforcement Learning with Human Feedback (RLHF)
Multi-Objective Reinforcement Learning

It is most closely related to methods like MORLHF, Rewarded Soups, and Rewards-in-Context (RiC), with claims of outperforming these methods in terms of average reward and consistency across objectives.

Techniques and Modules

Key techniques employed in MORLHF include:

Iterative Fine-Tuning: Controls variance among expert policies.
Reward Combining: Addresses optimization in multi-objective fine-tuning.
Merging: Combines expert policies during and after training to enhance performance.

Limitations and Open Questions

Despite its advancements, MORLHF faces challenges, such as:

The theoretical analysis may not directly translate to improved performance in deployment scenarios.
Variance among expert policies can still impact the overall effectiveness of the model.

Conclusion

MORLHF represents a significant advancement in the field of multi-objective reinforcement learning, providing a flexible and robust framework for optimizing foundation models across various applications, including drug discovery and text generation. Its iterative approach to fine-tuning and merging expert policies offers a promising solution to the challenges posed by existing methodologies.

Sources

https://arxiv.org/abs/2511.00220v1