[Paper] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
Source: arXiv - 2602.06034v1
Overview
The paper introduces V‑Retrver, a new framework that turns multimodal retrieval (searching for images, videos, or other media based on a textual query) into an agentic reasoning process. Instead of relying solely on pre‑computed visual embeddings, V‑Retrver lets a multimodal large language model (MLLM) actively request visual evidence from external tools, verify its hypotheses, and iteratively improve its ranking decisions. The result is a system that reasons more reliably on ambiguous visual content and boosts retrieval performance across several benchmarks.
Key Contributions
- Evidence‑driven retrieval paradigm – reframes retrieval as a loop of hypothesis generation → targeted visual inspection → hypothesis refinement.
- Agentic MLLM – equips the language model with the ability to invoke external visual tools (e.g., object detectors, OCR, region proposal networks) on‑the‑fly during reasoning.
- Curriculum‑based training pipeline – combines supervised “reasoning activation” data, a rejection‑based refinement stage, and reinforcement learning with an evidence‑alignment loss to teach the model when and how to request visual evidence.
- Strong empirical gains – achieves an average 23 % improvement in retrieval accuracy over strong baselines on multiple multimodal retrieval datasets.
- Demonstrated generalization – the same trained agent works across diverse domains (image‑text, video‑text, and cross‑modal retrieval) without task‑specific fine‑tuning.
Methodology
Problem Formulation
- Traditional multimodal retrieval pipelines encode each candidate image/video into a static vector and rank them with a similarity score.
- V‑Retrver treats each candidate as a potential evidence source and lets the MLLM decide whether more visual information is needed.
Agentic Reasoning Loop
- Hypothesis Generation – The MLLM reads the query and produces an initial ranking hypothesis (e.g., “the answer likely contains a red car”).
- Evidence Request – If the hypothesis is uncertain, the model issues a tool call such as “detect objects of type ‘car’ in image #3” or “run OCR on region (120,200,300,350)”.
- Tool Execution – An external visual module processes the request and returns concrete evidence (object labels, bounding boxes, text snippets).
- Verification & Refinement – The MLLM incorporates the evidence, revises its confidence scores, and may issue further requests until a stopping criterion is met.
Training Strategy
- Curriculum Learning – Starts with supervised examples where the correct evidence‑request sequence is provided, then gradually introduces harder cases requiring rejection‑based refinement.
- Rejection‑Based Refinement – The model learns to discard wrong hypotheses after seeing contradictory evidence, mimicking human “try‑and‑discard” reasoning.
- Reinforcement Learning (RL) – An evidence‑aligned reward encourages the model to request just enough evidence to reach the correct answer, penalizing unnecessary tool calls.
- Evidence‑Aligned Objective – The loss combines standard retrieval ranking loss with a term that measures how well the gathered evidence matches the ground‑truth visual cues.
Implementation Details
- Base MLLM: LLaMA‑2‑7B fine‑tuned with multimodal adapters.
- Visual Tools: Pre‑trained DETR for object detection, Tesseract OCR, CLIP‑based region embeddings, and a lightweight video frame sampler.
- Inference Overhead: ≈ 1.8× slower than a static encoder because evidence requests are only made for top‑k candidates.
Results & Findings
| Benchmark | Baseline (static encoder) | V‑Retrver | Δ (↑) |
|---|---|---|---|
| MSCOCO Image‑Text Retrieval | 38.2 % Recall@1 | 46.9 % | +23 % |
| Flickr30K | 41.5 % R@1 | 50.8 % | +22 % |
| TV‑QA Video‑Text Retrieval | 29.3 % R@1 | 36.7 % | +25 % |
| WebVision (noisy web images) | 31.0 % R@1 | 38.5 % | +24 % |
- Reliability: In cases with visually ambiguous queries (e.g., “a person holding a small object”), V‑Retrver’s evidence‑driven verification reduced hallucinations by ~40 % compared to pure‑language CoT methods.
- Generalization: Without any dataset‑specific fine‑tuning, the same agent achieved comparable gains on both image‑ and video‑based retrieval tasks, indicating the approach is not tied to a particular modality.
- Efficiency Trade‑off: The average number of tool calls per query was 2.3, striking a balance between performance boost and computational cost.
Practical Implications
- Better Search Engines – Integrating V‑Retrver‑style agents into image or video search platforms can improve relevance, especially for queries that hinge on fine‑grained visual details (e.g., “red sports car with a visible license plate”).
- Content Moderation – The ability to request targeted evidence (e.g., “detect nudity in region X”) can make automated moderation more precise and explainable.
- E‑Commerce – Product search can benefit from on‑demand verification (“show items with a visible brand logo”) without pre‑computing exhaustive attribute embeddings for every catalog item.
- Developer Toolkits – The framework is modular—any off‑the‑shelf visual model can be wrapped as a tool, allowing developers to plug in domain‑specific detectors (medical imaging, satellite imagery) and let the LLM orchestrate them.
- Explainability – Because the reasoning trace includes explicit evidence requests and tool outputs, developers can surface a “why this result?” view to end‑users, increasing trust.
Limitations & Future Work
- Latency – The interactive evidence‑gathering loop adds inference time, which may be prohibitive for real‑time applications without further optimization (e.g., caching frequent tool results).
- Tool Dependency – The quality of retrieved evidence is bounded by the underlying visual modules; poor detectors can mislead the reasoning process.
- Scalability to Large Corpora – Current experiments evaluate top‑k candidate reranking; extending the approach to full‑scale retrieval (millions of items) will require efficient candidate pruning strategies.
- Learning from Noisy Evidence – Future work could explore robust RL objectives that tolerate imperfect tool outputs, and investigate self‑supervised curricula that automatically generate evidence‑request sequences.
V‑Retrver opens a promising direction where language models become active agents that “look” at the world when needed, turning static retrieval pipelines into dynamic, evidence‑grounded systems.
Authors
- Dongyang Chen
- Chaoyang Wang
- Dezhao SU
- Xi Xiao
- Zeyu Zhang
- Jing Xiong
- Qing Li
- Yuzhang Shang
- Shichao Ka
Paper Information
- arXiv ID: 2602.06034v1
- Categories: cs.CV
- Published: February 5, 2026
- PDF: Download PDF