Generalized Contrastive Learning (GCL)

Overview

Generalized Contrastive Learning (GCL) is an innovative approach designed to enhance multimodal retrieval performance. It addresses the limitations of existing methods by integrating various modalities, such as text, images, and fused text-image representations, within a unified framework. GCL aims to improve retrieval tasks across diverse datasets without necessitating the curation of new datasets.

Architecture

GCL operates by enforcing contrastive learning across all modalities within a mini-batch. It constructs a unified representation space by integrating text embeddings, image embeddings, and fused text-image embeddings. The model utilizes a novel loss function that leverages existing image-caption paired datasets, effectively mitigating the modality gap in multimodal retrieval scenarios.

Goals

The primary objectives of GCL include:

Enhancing multimodal retrieval performance.
Improving retrieval effectiveness across diverse tasks and datasets.
Overcoming the limitations of existing methods that rely on meticulously curated datasets and fail to generalize to unseen modality combinations.

Dataset Info

GCL primarily utilizes existing image-caption paired datasets, which are essential for its training and evaluation. Notable datasets include:

LCS-558K dataset for general-purpose fine-tuning.
M-BEIR, MMEB, CoVR, MSCOCO, and others for benchmarking performance.

Outputs

GCL produces improved retrieval performance metrics across various benchmarks, including:

Recall@1, Recall@5, Recall@10, and Recall@50.
Enhanced cosine similarity between query and ground truth representations.

Relationship to Other Methods

GCL builds upon existing frameworks such as VISTA and AlignCLIP but avoids the need for generating new datasets for specific retrieval scenarios. It outperforms models trained with newly composed triplet datasets and demonstrates superior performance in multimodal retrieval tasks.

Objectives and Losses

GCL employs a variety of loss functions, including:

Generalized Contrastive Learning (GCL) loss function.
Contrastive loss.
Intra-modality separation loss.
Cross-modal alignment terms.

These losses are designed to optimize multimodal retrieval performance by ensuring effective alignment between different modalities.

Algorithm

GCL applies contrastive loss across all modalities within a mini-batch, finetuning cross-modal retrieval models using its unique loss function. The algorithm constructs negative samples from all possible combinations of embeddings, enhancing the model's ability to learn a unified representation space.

Techniques or Modules

Key techniques and modules within GCL include:

Generalized Contrastive Learning: Enhances multimodal retrieval performance by integrating various modalities.
Intra-modality Separation Loss: Mitigates the modality gap, improving retrieval performance.
Contrastive Learning: Utilizes pairwise preferences to enhance retrieval models.

Evaluation

GCL is evaluated using various settings, including global and local retrieval tasks. It has been benchmarked against multiple datasets and models, demonstrating significant improvements in retrieval performance. Notable results include:

GCL (Ours) + Pairwise achieving 37.52 Recall@1 on VISTA.
Consistent performance improvements across diverse tasks.

Limitations and Open Questions

While GCL shows promising results, future work may explore its integration with multimodal large language models (MLLMs) to further enhance its capabilities. Additionally, performance gains may be limited for tasks involving retrieval with identical modalities, and there may be performance drops on certain tasks.

Conclusion

Generalized Contrastive Learning represents a significant advancement in the field of multimodal retrieval, offering a scalable and effective solution that leverages existing datasets while improving performance across diverse scenarios.

Sources

https://arxiv.org/abs/2509.25638v1