[Paper] OmniRet: Efficient and High-Fidelity Omni Modality Retrieval
Source: arXiv - 2603.02098v1
Overview
OmniRet is the first retrieval system that can understand and search across text, images, and audio all at once. By tackling the twin problems of computational cost and loss of detail when compressing multimodal data, the authors push universal retrieval a step closer to the “search anything with anything” vision.
Key Contributions
- True omni‑modal retrieval: Supports composed queries that combine text, vision, and audio simultaneously.
- Efficient token reduction: Introduces an attention‑based resampling layer that turns long modality‑specific token streams into compact, fixed‑size embeddings, dramatically cutting inference cost.
- Fine‑grained pooling: Proposes Attention Sliced Wasserstein Pooling to retain subtle cross‑modal cues that typical pooling methods discard.
- Large‑scale training: Trains on ~6 M query‑target pairs drawn from 30 public datasets, covering a wide variety of retrieval scenarios.
- New benchmark (ACM): Releases the Audio‑Centric Multimodal Benchmark, adding composed‑audio and audio‑visual retrieval tasks that were missing from prior suites.
Methodology
- Modality encoders – Separate pretrained encoders (e.g., CLIP for vision, Whisper for audio, BERT‑style for text) first turn each input into a sequence of token embeddings.
- Attention‑based resampling – Instead of feeding the full token sequence to a large language model (LLM), a lightweight attention module selects the most informative tokens and aggregates them into a fixed‑size representation (e.g., 256‑dim). This keeps the downstream LLM cheap to run.
- Attention Sliced Wasserstein Pooling (ASWP) – The compact vectors from each modality are pooled together using a Wasserstein‑distance‑inspired loss that encourages the final embedding to preserve the distributional characteristics of the original token set. In practice, ASWP acts like a smart averaging that keeps fine‑grained patterns (e.g., a specific bird chirp or a subtle visual texture).
- Joint training – All components are trained end‑to‑end on a contrastive loss that pulls matching query‑target pairs together while pushing non‑matches apart. The massive, heterogeneous training set forces the model to learn a universal embedding space.
Results & Findings
| Task family | OmniRet vs. SOTA | Notable gain |
|---|---|---|
| Composed text‑vision‑audio queries | +12 % Recall@10 | Handles “a dog barking in a park” style queries |
| Pure audio retrieval | +9 % Recall@5 | Better capture of temporal cues |
| Video retrieval (audio‑visual) | +7 % Recall@10 | Leverages both sound and frames |
| Standard text‑image retrieval | On‑par (±0.3 % Recall) | No regression despite extra capacity |
The new ACM benchmark confirms that OmniRet uniquely solves the previously unsupported composed audio and audio‑visual retrieval tasks, achieving the highest scores among all baselines.
Practical Implications
- Search engines & digital assistants: Developers can build “search by example” features where a user drops a photo, speaks a phrase, and types additional constraints—all in one query.
- Content recommendation: Platforms (e.g., podcasts, video streaming) can match user‑generated multimodal snippets to catalog items, improving discoverability.
- Asset management: Media teams can locate assets by mixing modalities (e.g., “find the clip where a siren sounds while a red car passes”).
- Reduced infrastructure cost: The attention‑based resampling cuts token length by 70‑90 %, meaning existing LLM‑backed pipelines can adopt OmniRet without massive GPU upgrades.
Limitations & Future Work
- Scalability of training data: While 6 M pairs are large, the model still struggles on niche domains (e.g., medical imaging + auscultation audio) where data is scarce.
- Latency on edge devices: The resampling step is lightweight, but the full encoder stack (vision + audio + LLM) may still be too heavy for on‑device inference without further quantization.
- Modalities beyond the triad: The current design assumes three modalities; extending to haptics, 3‑D point clouds, or sensor streams will require architectural tweaks.
- Interpretability: The attention maps used for resampling are not yet exposed to end‑users; future work could surface “why this result was retrieved” to aid debugging.
OmniRet opens the door to truly universal retrieval systems, and its efficient design makes it a realistic candidate for integration into next‑generation search and recommendation platforms. Keep an eye on the upcoming ACM benchmark releases—they’ll likely become the new standard for measuring omni‑modal understanding.
Authors
- Chuong Huynh
- Manh Luong
- Abhinav Shrivastava
Paper Information
- arXiv ID: 2603.02098v1
- Categories: cs.IR, cs.CL, cs.CV
- Published: March 2, 2026
- PDF: Download PDF