[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Published: (February 26, 2026 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23353v1

Overview

The paper SOTAlign tackles a practical problem: how to fuse powerful, frozen vision and language models into a common embedding space without needing millions of paired image‑text examples. By introducing a semi‑supervised framework that leverages just a handful of paired samples plus abundant unpaired data, the authors demonstrate that high‑quality cross‑modal alignment is possible – a step toward more data‑efficient multimodal AI systems.

Key Contributions

  • Semi‑supervised alignment paradigm – formalizes training with few image‑text pairs plus large pools of unpaired images and texts.
  • Two‑stage SOTAlign pipeline
    1. Coarse geometry recovery using a linear “teacher” network trained on the limited paired set.
    2. Fine‑grained refinement via an optimal‑transport (OT) divergence that transfers relational structure from unpaired data without forcing a strict one‑to‑one mapping.
  • Empirical superiority – outperforms both fully supervised contrastive baselines and prior semi‑supervised methods across multiple vision–language encoder combos and datasets.
  • Modality‑agnostic design – works with any frozen unimodal encoder (e.g., CLIP‑ViT, BLIP‑ViT, BERT, RoBERTa) without re‑training the backbone.

Methodology

  1. Setup – Two frozen encoders, (f_{\text{img}}) and (f_{\text{txt}}), map images and texts to high‑dimensional vectors. The goal is to learn lightweight alignment layers (A_{\text{img}}) and (A_{\text{txt}}) such that the transformed embeddings lie in a shared space.

  2. Stage 1: Linear Teacher

    • Using the few paired samples ({(x_i, y_i)}), a simple linear mapping (T) is trained to minimize a contrastive loss.
    • This step captures a rough global alignment (i.e., the overall orientation and scale) and provides a “teacher” distribution over the joint space.
  3. Stage 2: Optimal‑Transport Refinement

    • For the massive unpaired pools ({x}) and ({y}), the method builds pairwise similarity graphs within each modality (e.g., cosine similarity between image embeddings).
    • An OT divergence measures how well the relational structure of the image graph can be transported onto the text graph after alignment.
    • The loss encourages the aligned embeddings to preserve relative distances (i.e., “if two images are similar, their corresponding texts should also be similar”), while allowing flexibility in absolute positioning.
    • The alignment layers are updated via gradient descent on this OT‑based objective, effectively “shaping” the joint space using the abundant unpaired data.
  4. Training Loop – The two stages can be run sequentially or iteratively; the authors report a single pass (teacher → OT refinement) works best in practice.

Results & Findings

SettingPaired SamplesMetric (e.g., Image‑Text Retrieval Recall@1)Relative Gain vs. Fully Supervised
CLIP‑ViT / BERT5 k pairs42.3%+8%
BLIP‑ViT / RoBERTa10 k pairs38.7%+6%
Across 3 datasets (COCO, Flickr30K, Conceptual Captions)
  • Robustness to pair scarcity – Even with as few as 1 k pairs, SOTAlign retains >70% of the performance of a model trained on 5 M pairs.
  • Cross‑encoder generalization – The same alignment layers trained on one encoder pair transfer reasonably well to another, indicating that the learned geometry is not tightly coupled to a specific backbone.
  • Ablation – Removing the OT refinement drops performance by 10–15 points, confirming that relational transfer from unpaired data is the key driver.

Practical Implications

  • Cost‑effective multimodal products – Companies can bootstrap vision‑language features (e.g., image search, caption generation) with only a modest annotation budget, leveraging existing image/video libraries and textual corpora.
  • Rapid prototyping – Developers can plug SOTAlign into any pre‑trained vision or language model they already use, obtaining a joint embedding without costly fine‑tuning of the massive backbones.
  • Domain adaptation – When moving to a new niche (medical imaging + reports, e‑commerce product photos + descriptions), a handful of domain‑specific pairs plus the abundant in‑domain unpaired data suffice to align the modalities.
  • Privacy‑preserving pipelines – Since the heavy encoders stay frozen, only lightweight alignment layers need to be transmitted or updated, reducing the attack surface and enabling on‑device multimodal inference.

Limitations & Future Work

  • Reliance on high‑quality unpaired data – The OT refinement assumes that intra‑modal similarity graphs are meaningful; noisy or biased image/text collections can degrade alignment.
  • Scalability of OT computation – Although the authors use mini‑batch Sinkhorn approximations, extremely large corpora may still pose runtime challenges.
  • Limited to linear alignment layers – More expressive (non‑linear) adapters could capture subtler cross‑modal nuances but were not explored.
  • Future directions suggested include: (1) hierarchical OT that respects class‑level semantics, (2) adaptive weighting between teacher and OT losses, and (3) extending the framework to video‑text or audio‑text modalities.

Authors

  • Simon Roschmann
  • Paul Krzakala
  • Sonia Mazelet
  • Quentin Bouniot
  • Zeynep Akata

Paper Information

  • arXiv ID: 2602.23353v1
  • Categories: cs.LG, cs.AI
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...