GRPO-RM: Group Relative Policy Optimization for Representation Model

Overview

GRPO-RM, or Group Relative Policy Optimization for Representation Model, is a novel reinforcement learning approach designed for fine-tuning representation models. It aims to enhance model performance, particularly in image classification and semantic segmentation tasks, while reducing computational overhead compared to traditional methods like Proximal Policy Optimization (PPO). GRPO-RM addresses the challenges posed by imbalanced datasets and improves accuracy on out-of-distribution datasets.

Architecture

GRPO-RM builds upon the foundational principles of Group Relative Policy Optimization (GRPO) and incorporates a reinforcement post-training framework. The architecture includes:

A policy model (π θ) and a reference model (π ref).
Redesigned reward functions tailored to representation learning, which include accuracy rewards and uniformity rewards.
A training pipeline that involves building a post-training model, computing rewards, calculating advantages, and updating the model using an Adam optimizer.

Goals

The primary objectives of GRPO-RM are to:

Maximize the performance of representation models through reinforcement learning.
Achieve accelerated convergence and improved accuracy in visual representation tasks.
Provide a robust solution to the challenges of fine-tuning representation models, particularly in scenarios with imbalanced datasets.

Dataset Info

GRPO-RM supports various dataset types, including:

Image classification datasets
Semantic segmentation datasets
Classification datasets
Segmentation datasets

The model is particularly effective in handling out-of-distribution datasets, demonstrating significant improvements in performance metrics.

Outputs

The outputs of GRPO-RM are evaluated based on several metrics, including:

Softmax Regression (SR)
k Nearest Neighbors (k NN)
Pixel accuracy
Intersection over Union (IoU)
Mean IoU

The model has shown an average accuracy improvement of 4.26% on out-of-distribution datasets and outperforms standard fine-tuning methods across multiple benchmarks.

Key Contributions

GRPO-RM introduces several innovative contributions to the field:

It is the first reinforcement post-training method specifically designed for representation learning models.
The model employs accuracy and uniformity rewards to enhance overall performance.
It achieves significant improvements in segmentation tasks and classification metrics, demonstrating superior performance compared to standard fine-tuning approaches.

Relationship to Other Methods

GRPO-RM builds on established methods such as GRPO and PPO, while also relating closely to Direct Preference Optimization (DPO) and standard fine-tuning techniques. It replaces traditional cross-entropy loss with a more nuanced reward-based approach, leading to improved performance metrics.

Algorithm

The high-level description of the GRPO-RM algorithm includes: 1. Building the post-train model (π θ) with a frozen feature extractor. 2. Iteratively updating the model by computing rewards and advantages based on the outputs of the reference model (π θ old). 3. Optimizing the model to maximize the defined loss function (L).

The algorithm emphasizes the importance of accuracy and uniformity rewards in shaping the model's predictions.

Practicalities

Hyperparameters

Key hyperparameters for GRPO-RM include:

ε (set to 0.2 for GRPO-RM)
β (fixed to 0)
Hidden layer dimensions (256 for image classification, 64 for segmentation)
Batch sizes (1024 for classification, 256 for segmentation)

Compute and Systems

GRPO-RM requires substantial computational resources, including:

A pretrained model and projection head.
Access to high-performance GPUs (e.g., 4 NVIDIA A100 vGPUs or 2 NVIDIA H200 vGPUs) to handle the memory-intensive operations.

Limitations and Open Questions

While GRPO-RM demonstrates significant advancements, it also faces limitations:

The outputs of language models differ from those of representation models, complicating direct comparisons.
The computation of advantages relies on token-level reasoning, which may not translate well to visual feature extraction.
The absence of a reference model limits the application of KL divergence in certain scenarios.

Evaluation

GRPO-RM has been rigorously evaluated across various datasets, including CIFAR-10, Tiny-ImageNet, and Pascal VOC. The evaluation settings focus on performance metrics that highlight the model's strengths in both classification and segmentation tasks, showcasing its ability to outperform standard fine-tuning methods.

In summary, GRPO-RM represents a significant step forward in the fine-tuning of representation models, leveraging reinforcement learning to achieve improved performance and faster convergence in challenging scenarios.

Sources

https://arxiv.org/abs/2511.15256v1