[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Published: 3 days ago (February 26, 2026 at 01:55 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23353v1

Overview

The paper SOTAlign tackles a practical problem: how to fuse powerful, frozen vision and language models into a common embedding space without needing millions of paired image‑text examples. By introducing a semi‑supervised framework that leverages just a handful of paired samples plus abundant unpaired data, the authors demonstrate that high‑quality cross‑modal alignment is possible – a step toward more data‑efficient multimodal AI systems.

Key Contributions

Semi‑supervised alignment paradigm – formalizes training with few image‑text pairs plus large pools of unpaired images and texts.
Two‑stage SOTAlign pipeline
1. Coarse geometry recovery using a linear “teacher” network trained on the limited paired set.
2. Fine‑grained refinement via an optimal‑transport (OT) divergence that transfers relational structure from unpaired data without forcing a strict one‑to‑one mapping.
Empirical superiority – outperforms both fully supervised contrastive baselines and prior semi‑supervised methods across multiple vision–language encoder combos and datasets.
Modality‑agnostic design – works with any frozen unimodal encoder (e.g., CLIP‑ViT, BLIP‑ViT, BERT, RoBERTa) without re‑training the backbone.

Methodology

Setup – Two frozen encoders, (f_{\text{img}}) and (f_{\text{txt}}), map images and texts to high‑dimensional vectors. The goal is to learn lightweight alignment layers (A_{\text{img}}) and (A_{\text{txt}}) such that the transformed embeddings lie in a shared space.
Stage 1: Linear Teacher
- Using the few paired samples ({(x_i, y_i)}), a simple linear mapping (T) is trained to minimize a contrastive loss.
- This step captures a rough global alignment (i.e., the overall orientation and scale) and provides a “teacher” distribution over the joint space.
Stage 2: Optimal‑Transport Refinement
- For the massive unpaired pools ({x}) and ({y}), the method builds pairwise similarity graphs within each modality (e.g., cosine similarity between image embeddings).
- An OT divergence measures how well the relational structure of the image graph can be transported onto the text graph after alignment.
- The loss encourages the aligned embeddings to preserve relative distances (i.e., “if two images are similar, their corresponding texts should also be similar”), while allowing flexibility in absolute positioning.
- The alignment layers are updated via gradient descent on this OT‑based objective, effectively “shaping” the joint space using the abundant unpaired data.
Training Loop – The two stages can be run sequentially or iteratively; the authors report a single pass (teacher → OT refinement) works best in practice.

Results & Findings

Setting	Paired Samples	Metric (e.g., Image‑Text Retrieval Recall@1)	Relative Gain vs. Fully Supervised
CLIP‑ViT / BERT	5 k pairs	42.3%	+8%
BLIP‑ViT / RoBERTa	10 k pairs	38.7%	+6%
Across 3 datasets (COCO, Flickr30K, Conceptual Captions)	–	–	–

Robustness to pair scarcity – Even with as few as 1 k pairs, SOTAlign retains >70% of the performance of a model trained on 5 M pairs.
Cross‑encoder generalization – The same alignment layers trained on one encoder pair transfer reasonably well to another, indicating that the learned geometry is not tightly coupled to a specific backbone.
Ablation – Removing the OT refinement drops performance by 10–15 points, confirming that relational transfer from unpaired data is the key driver.

Practical Implications

Cost‑effective multimodal products – Companies can bootstrap vision‑language features (e.g., image search, caption generation) with only a modest annotation budget, leveraging existing image/video libraries and textual corpora.
Rapid prototyping – Developers can plug SOTAlign into any pre‑trained vision or language model they already use, obtaining a joint embedding without costly fine‑tuning of the massive backbones.
Domain adaptation – When moving to a new niche (medical imaging + reports, e‑commerce product photos + descriptions), a handful of domain‑specific pairs plus the abundant in‑domain unpaired data suffice to align the modalities.
Privacy‑preserving pipelines – Since the heavy encoders stay frozen, only lightweight alignment layers need to be transmitted or updated, reducing the attack surface and enabling on‑device multimodal inference.

Limitations & Future Work

Reliance on high‑quality unpaired data – The OT refinement assumes that intra‑modal similarity graphs are meaningful; noisy or biased image/text collections can degrade alignment.
Scalability of OT computation – Although the authors use mini‑batch Sinkhorn approximations, extremely large corpora may still pose runtime challenges.
Limited to linear alignment layers – More expressive (non‑linear) adapters could capture subtler cross‑modal nuances but were not explored.
Future directions suggested include: (1) hierarchical OT that respects class‑level semantics, (2) adaptive weighting between teacher and OT losses, and (3) extending the framework to video‑text or audio‑text modalities.

Authors

Simon Roschmann
Paul Krzakala
Sonia Mazelet
Quentin Bouniot
Zeynep Akata

Paper Information

arXiv ID: 2602.23353v1
Categories: cs.LG, cs.AI
Published: February 26, 2026
PDF: Download PDF

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] FlashOptim: Optimizers for Memory Efficient Training