[Paper] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Published: 2 months ago (February 5, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06034v1

Overview

The paper introduces V‑Retrver, a new framework that turns multimodal retrieval (searching for images, videos, or other media based on a textual query) into an agentic reasoning process. Instead of relying solely on pre‑computed visual embeddings, V‑Retrver lets a multimodal large language model (MLLM) actively request visual evidence from external tools, verify its hypotheses, and iteratively improve its ranking decisions. The result is a system that reasons more reliably on ambiguous visual content and boosts retrieval performance across several benchmarks.

Key Contributions

Evidence‑driven retrieval paradigm – reframes retrieval as a loop of hypothesis generation → targeted visual inspection → hypothesis refinement.
Agentic MLLM – equips the language model with the ability to invoke external visual tools (e.g., object detectors, OCR, region proposal networks) on‑the‑fly during reasoning.
Curriculum‑based training pipeline – combines supervised “reasoning activation” data, a rejection‑based refinement stage, and reinforcement learning with an evidence‑alignment loss to teach the model when and how to request visual evidence.
Strong empirical gains – achieves an average 23 % improvement in retrieval accuracy over strong baselines on multiple multimodal retrieval datasets.
Demonstrated generalization – the same trained agent works across diverse domains (image‑text, video‑text, and cross‑modal retrieval) without task‑specific fine‑tuning.

Methodology

Problem Formulation

Traditional multimodal retrieval pipelines encode each candidate image/video into a static vector and rank them with a similarity score.
V‑Retrver treats each candidate as a potential evidence source and lets the MLLM decide whether more visual information is needed.

Agentic Reasoning Loop

Hypothesis Generation – The MLLM reads the query and produces an initial ranking hypothesis (e.g., “the answer likely contains a red car”).
Evidence Request – If the hypothesis is uncertain, the model issues a tool call such as “detect objects of type ‘car’ in image #3” or “run OCR on region (120,200,300,350)”.
Tool Execution – An external visual module processes the request and returns concrete evidence (object labels, bounding boxes, text snippets).
Verification & Refinement – The MLLM incorporates the evidence, revises its confidence scores, and may issue further requests until a stopping criterion is met.

Training Strategy

Curriculum Learning – Starts with supervised examples where the correct evidence‑request sequence is provided, then gradually introduces harder cases requiring rejection‑based refinement.
Rejection‑Based Refinement – The model learns to discard wrong hypotheses after seeing contradictory evidence, mimicking human “try‑and‑discard” reasoning.
Reinforcement Learning (RL) – An evidence‑aligned reward encourages the model to request just enough evidence to reach the correct answer, penalizing unnecessary tool calls.
Evidence‑Aligned Objective – The loss combines standard retrieval ranking loss with a term that measures how well the gathered evidence matches the ground‑truth visual cues.

Implementation Details

Base MLLM: LLaMA‑2‑7B fine‑tuned with multimodal adapters.
Visual Tools: Pre‑trained DETR for object detection, Tesseract OCR, CLIP‑based region embeddings, and a lightweight video frame sampler.
Inference Overhead: ≈ 1.8× slower than a static encoder because evidence requests are only made for top‑k candidates.

Results & Findings

Benchmark	Baseline (static encoder)	V‑Retrver	Δ (↑)
MSCOCO Image‑Text Retrieval	38.2 % Recall@1	46.9 %	+23 %
Flickr30K	41.5 % R@1	50.8 %	+22 %
TV‑QA Video‑Text Retrieval	29.3 % R@1	36.7 %	+25 %
WebVision (noisy web images)	31.0 % R@1	38.5 %	+24 %

Reliability: In cases with visually ambiguous queries (e.g., “a person holding a small object”), V‑Retrver’s evidence‑driven verification reduced hallucinations by ~40 % compared to pure‑language CoT methods.
Generalization: Without any dataset‑specific fine‑tuning, the same agent achieved comparable gains on both image‑ and video‑based retrieval tasks, indicating the approach is not tied to a particular modality.
Efficiency Trade‑off: The average number of tool calls per query was 2.3, striking a balance between performance boost and computational cost.

Practical Implications

Better Search Engines – Integrating V‑Retrver‑style agents into image or video search platforms can improve relevance, especially for queries that hinge on fine‑grained visual details (e.g., “red sports car with a visible license plate”).
Content Moderation – The ability to request targeted evidence (e.g., “detect nudity in region X”) can make automated moderation more precise and explainable.
E‑Commerce – Product search can benefit from on‑demand verification (“show items with a visible brand logo”) without pre‑computing exhaustive attribute embeddings for every catalog item.
Developer Toolkits – The framework is modular—any off‑the‑shelf visual model can be wrapped as a tool, allowing developers to plug in domain‑specific detectors (medical imaging, satellite imagery) and let the LLM orchestrate them.
Explainability – Because the reasoning trace includes explicit evidence requests and tool outputs, developers can surface a “why this result?” view to end‑users, increasing trust.

Limitations & Future Work

Latency – The interactive evidence‑gathering loop adds inference time, which may be prohibitive for real‑time applications without further optimization (e.g., caching frequent tool results).
Tool Dependency – The quality of retrieved evidence is bounded by the underlying visual modules; poor detectors can mislead the reasoning process.
Scalability to Large Corpora – Current experiments evaluate top‑k candidate reranking; extending the approach to full‑scale retrieval (millions of items) will require efficient candidate pruning strategies.
Learning from Noisy Evidence – Future work could explore robust RL objectives that tolerate imperfect tool outputs, and investigate self‑supervised curricula that automatically generate evidence‑request sequences.

V‑Retrver opens a promising direction where language models become active agents that “look” at the world when needed, turning static retrieval pipelines into dynamic, evidence‑grounded systems.

Authors

Dongyang Chen
Chaoyang Wang
Dezhao SU
Xi Xiao
Zeyu Zhang
Jing Xiong
Qing Li
Yuzhang Shang
Shichao Ka

Paper Information

arXiv ID: 2602.06034v1
Categories: cs.CV
Published: February 5, 2026
PDF: Download PDF