[Paper] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Published: 2 months ago (November 26, 2025 at 07:25 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21331v1

Overview

The paper introduces Contrastive Fusion (ConFu), a new framework for learning joint embeddings across any number of modalities (e.g., image, text, audio). Unlike most existing methods that only align pairs of modalities, ConFu simultaneously preserves pairwise relationships and captures higher‑order interactions (think “XOR‑style” dependencies) by treating fused modality combinations as first‑class citizens in the contrastive learning objective. The result is a single, unified embedding space that works well for both multimodal retrieval and single‑modality downstream tasks.

Key Contributions

Unified contrastive objective that jointly optimizes:
1. Traditional pairwise modality alignment.
2. A novel fused‑modality contrastive term that aligns a pair of modalities with a third (or more) fused representation.
Higher‑order dependency modeling: Demonstrates the ability to capture relationships that are invisible to pairwise alignment alone (e.g., XOR‑like patterns).
One‑size‑fits‑all retrieval: Supports both one‑to‑one (image ↔ text) and two‑to‑one (image + audio ↔ text) queries within the same training pipeline.
Extensive evaluation on synthetic benchmarks (to isolate higher‑order effects) and real‑world datasets (e.g., MS‑COCO, Flickr30K, AudioSet), showing competitive or superior performance on retrieval and classification.
Scalability analysis: Empirically shows that ConFu’s performance degrades gracefully as the number of modalities grows.

Methodology

Backbone encoders – Each modality (image, text, audio, etc.) is processed by a modality‑specific encoder (ResNet, BERT, VGGish, …). The encoders are frozen or fine‑tuned depending on the experiment.
Fusion module – For any subset of modalities, their embeddings are combined using a simple element‑wise sum followed by a linear projection. This yields a fused representation that lives in the same dimensionality as the individual embeddings.
Contrastive loss extension –
- Pairwise term: Classic InfoNCE loss that pulls together matching pairs (e.g., image ↔ caption) and pushes apart mismatched pairs.
- Fused‑modality term: Adds an extra contrastive objective that treats a fused embedding (e.g., image + audio) as an anchor and aligns it with the remaining modality (e.g., text). The loss is symmetric, so the fused representation also learns to be close to each constituent modality.
Training loop – All terms are summed with a weighting hyper‑parameter λ. The model is trained end‑to‑end with stochastic gradient descent, using standard data augmentations per modality.
Inference – Because every modality and every fused combination share the same embedding space, a single nearest‑neighbor search can answer any retrieval query (single‑modality or multimodal).

Results & Findings

Dataset	Task	Metric (higher = better)	Baseline (pairwise)	ConFu
MS‑COCO (image‑text)	1‑to‑1 retrieval	Recall@1	45.2 %	48.7 %
Flickr30K (image‑text‑audio)	2‑to‑1 retrieval (image + audio → text)	Recall@5	31.8 %	36.4 %
Synthetic XOR benchmark	Classification of XOR‑type label	Accuracy	62 %	84 %
AudioSet (audio‑video‑text)	Multimodal classification	mAP	21.5	24.3

Higher‑order capture: On the synthetic XOR task, ConFu recovers the hidden relationship that pairwise models completely miss.
Unified retrieval: A single model handles both one‑to‑one and two‑to‑one queries without extra fine‑tuning.
Scalability: Adding a fourth modality (e.g., depth) only drops performance by ~2 % relative, confirming the method’s robustness.

Overall, ConFu matches or exceeds state‑of‑the‑art pairwise contrastive models while offering richer multimodal reasoning.

Practical Implications

Search engines & recommendation systems – Developers can build a single index that answers queries like “show me images that match this caption and this short audio clip,” without training separate models for each query type.
Cross‑modal content creation tools – Tools that auto‑generate subtitles, captions, or soundtracks can leverage the higher‑order embeddings to ensure the generated modality respects the joint semantics of the others.
Edge‑friendly deployment – Because the fusion step is just a linear projection, the extra compute over a vanilla pairwise contrastive model is minimal, making it suitable for on‑device inference (e.g., AR glasses that combine vision and audio cues).
Data efficiency – By preserving pairwise alignment, ConFu still performs well when only a subset of modalities is present at test time, which is common in real‑world pipelines where some sensors may be missing.

Limitations & Future Work

Fusion simplicity – The current element‑wise sum + linear projection may not capture complex interactions for very heterogeneous modalities (e.g., video + 3‑D point clouds). More expressive fusion (attention, cross‑modal transformers) could boost performance.
Training cost – Adding fused‑modality contrastive terms increases the number of negative samples, leading to higher memory usage for large batch sizes. Efficient negative mining strategies are an open avenue.
Limited modality count – Experiments stop at three–four modalities; scaling to dozens (e.g., sensor networks) may require hierarchical fusion or curriculum learning.
Theoretical analysis – While empirical results show higher‑order capture, a formal proof of what classes of functions ConFu can represent remains to be explored.

Bottom line: Contrastive Fusion offers a pragmatic, developer‑friendly recipe for building multimodal systems that think beyond simple pairwise matches, opening the door to richer, context‑aware AI products.

Authors

Stefanos Koutoupis
Michaela Areti Zervou
Konstantinos Kontras
Maarten De Vos
Panagiotis Tsakalides
Grigorios Tsagatakis

Paper Information

arXiv ID: 2511.21331v1
Categories: cs.CV, cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

[Paper] TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

[Paper] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning