[Paper] Multi-Vector Index Compression in Any Modality

Published: 3 days ago (February 24, 2026 at 01:57 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.21202v1

Overview

The paper “Multi-Vector Index Compression in Any Modality” tackles a pressing bottleneck in modern retrieval systems that rely on late interaction—a technique that compares query and document vectors token‑by‑token to achieve high accuracy across text, images, and video. While powerful, late interaction scales linearly with document length, making storage and latency prohibitive for media‑rich collections. The authors propose a suite of query‑agnostic compression strategies that shrink multi‑vector document indexes to a fixed budget without sacrificing retrieval quality.

Key Contributions

Four compression paradigms for multi‑vector indexes:
1. Sequence resizing – uniformly truncates or pads token sequences.
2. Memory tokens – learns a small set of “memory” vectors that summarize a document.
3. Hierarchical pooling – builds a non‑parametric tree of pooled vectors.
4. Attention‑Guided Clustering (AGC) – a novel, learnable clustering that selects semantically salient token centroids using attention scores.
Unified evaluation across heterogeneous retrieval benchmarks (BEIR for text, ViDoRe for visual documents, MSR‑VTT & MultiVENT 2.0 for video).
Demonstrated that AGC consistently outperforms other compression methods and can match or exceed the performance of an uncompressed index while using far fewer vectors.
Open‑source implementation released (github.com/hanxiangqin/omni-col-press), enabling immediate experimentation.

Methodology

Late‑interaction models (e.g., ColBERT, ViLT) store a set of token‑level embeddings for each document. Retrieval involves a dot‑product between every query token and every document token, which is costly when documents contain hundreds of tokens (think video frames or high‑resolution images).

The authors treat compression as a query‑agnostic mapping: given a document’s full token matrix X ∈ ℝ^{L×d} (L = token count, d = embedding dim), produce a compact representation C ∈ ℝ^{K×d} where K ≪ L and is fixed across the corpus.

Sequence resizing simply selects the first K tokens (or pads if L < K).
Memory tokens learn K global vectors that are updated via back‑propagation to best reconstruct the original token set.
Hierarchical pooling recursively pools neighboring tokens (e.g., average‑pool then max‑pool) to build a tree; leaf nodes at a chosen depth become the compressed set.
Attention‑Guided Clustering (AGC):
- Compute attention scores for each token using a lightweight query‑independent attention head.
- Use these scores as importance weights in a differentiable clustering loss (similar to soft K‑means).
- The resulting centroids become the compressed vectors, and the attention weights guide how much each original token contributes to its centroid.

During training, the compression module is jointly optimized with the downstream retrieval loss, ensuring that the compressed index remains highly discriminative for the final similarity scoring.

Results & Findings

Benchmark	Full Index (baseline)	Best Compression (AGC)	Gap to Baseline
BEIR (text)	nDCG@10 = 0.543	nDCG@10 = 0.537 (K=64)	–1.1 %
ViDoRe (visual docs)	Recall@10 = 0.712	Recall@10 = 0.704 (K=48)	–1.1 %
MSR‑VTT (video)	Recall@5 = 0.381	Recall@5 = 0.376 (K=32)	–1.3 %
MultiVENT 2.0 (video)	mAP = 0.462	mAP = 0.459 (K=32)	–0.6 %

Key takeaways

AGC consistently beats sequence resizing and memory tokens across all modalities, often by a margin of 2–5 % absolute when the compression ratio is aggressive (K ≈ 30 % of original tokens).
Hierarchical pooling offers flexibility (any K can be chosen post‑hoc) but lags behind AGC because it lacks learned semantic weighting.
Performance degradation is modest even when reducing the index to a fraction of its original size, demonstrating that much of the token‑level information is redundant.

Practical Implications

Scalable Search Services – Cloud providers can store multi‑vector indexes for billions of images or video clips using a fraction of the memory, cutting infrastructure costs dramatically.
Edge Deployment – Mobile or IoT devices can embed a compressed index locally (e.g., for on‑device image search) without exhausting limited storage or compute budgets.
Faster Retrieval – Fewer token‑to‑token comparisons translate directly into lower latency, enabling real‑time multimodal search in interactive applications (e.g., visual product recommendation, video clip retrieval).
Unified Pipeline – Because the compression is modality‑agnostic, a single retrieval backend can handle text, images, and video uniformly, simplifying system architecture for platforms that index mixed‑media content.
Open‑source Toolkit – The released codebase includes ready‑to‑use PyTorch modules and scripts for integrating AGC into existing late‑interaction models, lowering the barrier for developers to experiment.

Limitations & Future Work

Query‑agnostic compression means the index cannot adapt to specific query distributions; future work could explore hybrid schemes that add lightweight query‑dependent refinements.
Training overhead – Jointly learning the compression module adds extra epochs and memory during model fine‑tuning, which may be prohibitive for very large corpora.
Fixed‑size budget – While convenient, a static K may be suboptimal for documents with highly variable semantic density (e.g., a short caption vs. a long documentary). Adaptive budgeting strategies are an open direction.
Evaluation scope – The paper focuses on retrieval metrics; downstream tasks such as reranking, relevance feedback, or cross‑modal generation were not examined. Extending compression to those scenarios could broaden impact.

Overall, the study provides a practical roadmap for shrinking multimodal retrieval indexes without sacrificing the high accuracy that late interaction models are known for—an advance that could make next‑generation search systems both smarter and cheaper.

Authors

Hanxiang Qin
Alexander Martin
Rohan Jha
Chunsheng Zuo
Reno Kriz
Benjamin Van Durme

Paper Information

arXiv ID: 2602.21202v1
Categories: cs.IR, cs.CL, cs.CV
Published: February 24, 2026
PDF: Download PDF

[Paper] Multi-Vector Index Compression in Any Modality

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[Paper] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

[Paper] Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning