[Paper] OmniRet: Efficient and High-Fidelity Omni Modality Retrieval

Published: 1 day ago (March 2, 2026 at 12:19 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.02098v1

Overview

OmniRet is the first retrieval system that can understand and search across text, images, and audio all at once. By tackling the twin problems of computational cost and loss of detail when compressing multimodal data, the authors push universal retrieval a step closer to the “search anything with anything” vision.

Key Contributions

True omni‑modal retrieval: Supports composed queries that combine text, vision, and audio simultaneously.
Efficient token reduction: Introduces an attention‑based resampling layer that turns long modality‑specific token streams into compact, fixed‑size embeddings, dramatically cutting inference cost.
Fine‑grained pooling: Proposes Attention Sliced Wasserstein Pooling to retain subtle cross‑modal cues that typical pooling methods discard.
Large‑scale training: Trains on ~6 M query‑target pairs drawn from 30 public datasets, covering a wide variety of retrieval scenarios.
New benchmark (ACM): Releases the Audio‑Centric Multimodal Benchmark, adding composed‑audio and audio‑visual retrieval tasks that were missing from prior suites.

Methodology

Modality encoders – Separate pretrained encoders (e.g., CLIP for vision, Whisper for audio, BERT‑style for text) first turn each input into a sequence of token embeddings.
Attention‑based resampling – Instead of feeding the full token sequence to a large language model (LLM), a lightweight attention module selects the most informative tokens and aggregates them into a fixed‑size representation (e.g., 256‑dim). This keeps the downstream LLM cheap to run.
Attention Sliced Wasserstein Pooling (ASWP) – The compact vectors from each modality are pooled together using a Wasserstein‑distance‑inspired loss that encourages the final embedding to preserve the distributional characteristics of the original token set. In practice, ASWP acts like a smart averaging that keeps fine‑grained patterns (e.g., a specific bird chirp or a subtle visual texture).
Joint training – All components are trained end‑to‑end on a contrastive loss that pulls matching query‑target pairs together while pushing non‑matches apart. The massive, heterogeneous training set forces the model to learn a universal embedding space.

Results & Findings

Task family	OmniRet vs. SOTA	Notable gain
Composed text‑vision‑audio queries	+12 % Recall@10	Handles “a dog barking in a park” style queries
Pure audio retrieval	+9 % Recall@5	Better capture of temporal cues
Video retrieval (audio‑visual)	+7 % Recall@10	Leverages both sound and frames
Standard text‑image retrieval	On‑par (±0.3 % Recall)	No regression despite extra capacity

The new ACM benchmark confirms that OmniRet uniquely solves the previously unsupported composed audio and audio‑visual retrieval tasks, achieving the highest scores among all baselines.

Practical Implications

Search engines & digital assistants: Developers can build “search by example” features where a user drops a photo, speaks a phrase, and types additional constraints—all in one query.
Content recommendation: Platforms (e.g., podcasts, video streaming) can match user‑generated multimodal snippets to catalog items, improving discoverability.
Asset management: Media teams can locate assets by mixing modalities (e.g., “find the clip where a siren sounds while a red car passes”).
Reduced infrastructure cost: The attention‑based resampling cuts token length by 70‑90 %, meaning existing LLM‑backed pipelines can adopt OmniRet without massive GPU upgrades.

Limitations & Future Work

Scalability of training data: While 6 M pairs are large, the model still struggles on niche domains (e.g., medical imaging + auscultation audio) where data is scarce.
Latency on edge devices: The resampling step is lightweight, but the full encoder stack (vision + audio + LLM) may still be too heavy for on‑device inference without further quantization.
Modalities beyond the triad: The current design assumes three modalities; extending to haptics, 3‑D point clouds, or sensor streams will require architectural tweaks.
Interpretability: The attention maps used for resampling are not yet exposed to end‑users; future work could surface “why this result was retrieved” to aid debugging.

OmniRet opens the door to truly universal retrieval systems, and its efficient design makes it a realistic candidate for integration into next‑generation search and recommendation platforms. Keep an eye on the upcoming ACM benchmark releases—they’ll likely become the new standard for measuring omni‑modal understanding.

Authors

Chuong Huynh
Manh Luong
Abhinav Shrivastava

Paper Information

arXiv ID: 2603.02098v1
Categories: cs.IR, cs.CL, cs.CV
Published: March 2, 2026
PDF: Download PDF

[Paper] OmniRet: Efficient and High-Fidelity Omni Modality Retrieval

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

[Paper] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Adaptive Confidence Regularization for Multimodal Failure Detection