[Paper] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Published: (February 5, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.06034v1

Overview

The paper introduces V‑Retrver, a new framework that turns multimodal retrieval (searching for images, videos, or other media based on a textual query) into an agentic reasoning process. Instead of relying solely on pre‑computed visual embeddings, V‑Retrver lets a multimodal large language model (MLLM) actively request visual evidence from external tools, verify its hypotheses, and iteratively improve its ranking decisions. The result is a system that reasons more reliably on ambiguous visual content and boosts retrieval performance across several benchmarks.

Key Contributions

  • Evidence‑driven retrieval paradigm – reframes retrieval as a loop of hypothesis generation → targeted visual inspection → hypothesis refinement.
  • Agentic MLLM – equips the language model with the ability to invoke external visual tools (e.g., object detectors, OCR, region proposal networks) on‑the‑fly during reasoning.
  • Curriculum‑based training pipeline – combines supervised “reasoning activation” data, a rejection‑based refinement stage, and reinforcement learning with an evidence‑alignment loss to teach the model when and how to request visual evidence.
  • Strong empirical gains – achieves an average 23 % improvement in retrieval accuracy over strong baselines on multiple multimodal retrieval datasets.
  • Demonstrated generalization – the same trained agent works across diverse domains (image‑text, video‑text, and cross‑modal retrieval) without task‑specific fine‑tuning.

Methodology

Problem Formulation

  • Traditional multimodal retrieval pipelines encode each candidate image/video into a static vector and rank them with a similarity score.
  • V‑Retrver treats each candidate as a potential evidence source and lets the MLLM decide whether more visual information is needed.

Agentic Reasoning Loop

  1. Hypothesis Generation – The MLLM reads the query and produces an initial ranking hypothesis (e.g., “the answer likely contains a red car”).
  2. Evidence Request – If the hypothesis is uncertain, the model issues a tool call such as “detect objects of type ‘car’ in image #3” or “run OCR on region (120,200,300,350)”.
  3. Tool Execution – An external visual module processes the request and returns concrete evidence (object labels, bounding boxes, text snippets).
  4. Verification & Refinement – The MLLM incorporates the evidence, revises its confidence scores, and may issue further requests until a stopping criterion is met.

Training Strategy

  • Curriculum Learning – Starts with supervised examples where the correct evidence‑request sequence is provided, then gradually introduces harder cases requiring rejection‑based refinement.
  • Rejection‑Based Refinement – The model learns to discard wrong hypotheses after seeing contradictory evidence, mimicking human “try‑and‑discard” reasoning.
  • Reinforcement Learning (RL) – An evidence‑aligned reward encourages the model to request just enough evidence to reach the correct answer, penalizing unnecessary tool calls.
  • Evidence‑Aligned Objective – The loss combines standard retrieval ranking loss with a term that measures how well the gathered evidence matches the ground‑truth visual cues.

Implementation Details

  • Base MLLM: LLaMA‑2‑7B fine‑tuned with multimodal adapters.
  • Visual Tools: Pre‑trained DETR for object detection, Tesseract OCR, CLIP‑based region embeddings, and a lightweight video frame sampler.
  • Inference Overhead: ≈ 1.8× slower than a static encoder because evidence requests are only made for top‑k candidates.

Results & Findings

BenchmarkBaseline (static encoder)V‑RetrverΔ (↑)
MSCOCO Image‑Text Retrieval38.2 % Recall@146.9 %+23 %
Flickr30K41.5 % R@150.8 %+22 %
TV‑QA Video‑Text Retrieval29.3 % R@136.7 %+25 %
WebVision (noisy web images)31.0 % R@138.5 %+24 %
  • Reliability: In cases with visually ambiguous queries (e.g., “a person holding a small object”), V‑Retrver’s evidence‑driven verification reduced hallucinations by ~40 % compared to pure‑language CoT methods.
  • Generalization: Without any dataset‑specific fine‑tuning, the same agent achieved comparable gains on both image‑ and video‑based retrieval tasks, indicating the approach is not tied to a particular modality.
  • Efficiency Trade‑off: The average number of tool calls per query was 2.3, striking a balance between performance boost and computational cost.

Practical Implications

  • Better Search Engines – Integrating V‑Retrver‑style agents into image or video search platforms can improve relevance, especially for queries that hinge on fine‑grained visual details (e.g., “red sports car with a visible license plate”).
  • Content Moderation – The ability to request targeted evidence (e.g., “detect nudity in region X”) can make automated moderation more precise and explainable.
  • E‑Commerce – Product search can benefit from on‑demand verification (“show items with a visible brand logo”) without pre‑computing exhaustive attribute embeddings for every catalog item.
  • Developer Toolkits – The framework is modular—any off‑the‑shelf visual model can be wrapped as a tool, allowing developers to plug in domain‑specific detectors (medical imaging, satellite imagery) and let the LLM orchestrate them.
  • Explainability – Because the reasoning trace includes explicit evidence requests and tool outputs, developers can surface a “why this result?” view to end‑users, increasing trust.

Limitations & Future Work

  • Latency – The interactive evidence‑gathering loop adds inference time, which may be prohibitive for real‑time applications without further optimization (e.g., caching frequent tool results).
  • Tool Dependency – The quality of retrieved evidence is bounded by the underlying visual modules; poor detectors can mislead the reasoning process.
  • Scalability to Large Corpora – Current experiments evaluate top‑k candidate reranking; extending the approach to full‑scale retrieval (millions of items) will require efficient candidate pruning strategies.
  • Learning from Noisy Evidence – Future work could explore robust RL objectives that tolerate imperfect tool outputs, and investigate self‑supervised curricula that automatically generate evidence‑request sequences.

V‑Retrver opens a promising direction where language models become active agents that “look” at the world when needed, turning static retrieval pipelines into dynamic, evidence‑grounded systems.

Authors

  • Dongyang Chen
  • Chaoyang Wang
  • Dezhao SU
  • Xi Xiao
  • Zeyu Zhang
  • Jing Xiong
  • Qing Li
  • Yuzhang Shang
  • Shichao Ka

Paper Information

  • arXiv ID: 2602.06034v1
  • Categories: cs.CV
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »