[Paper] RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

Published: (February 20, 2026 at 01:48 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18425v1

Overview

The paper “RVR: Retrieve‑Verify‑Retrieve for Comprehensive Question Answering” proposes a simple yet powerful multi‑round retrieval pipeline that dramatically improves the chance of surfacing all correct answers to a question. By iteratively refining the query with verified documents, the authors show that even off‑the‑shelf retrievers can achieve substantially higher recall on challenging multi‑answer datasets.

Key Contributions

  • RVR framework – a three‑step loop (retrieve → verify → retrieve) that repeatedly expands the query with high‑quality evidence.
  • Verifier module – a lightweight classifier that filters the first‑round results to a trusted subset, guiding the next retrieval round.
  • Retriever adaptation – fine‑tuning existing dense/sparse retrievers to work with the RVR inference pattern yields extra gains.
  • Strong empirical gains – ≥10 % relative (≈3 % absolute) improvement in complete recall on QAMPARI, plus consistent lifts on out‑of‑domain benchmarks (QUEST, WebQuestionsSP).
  • Compatibility – works with any standard retriever (BM25, DPR, ColBERT, etc.) and can be dropped into existing QA pipelines with minimal engineering effort.

Methodology

  1. First Retrieval – The original user query is fed to a conventional retriever, producing a candidate set of documents.
  2. Verification – A verifier (trained on a small amount of labeled data) scores each candidate and selects a high‑precision subset that is likely to contain correct answers.
  3. Query Augmentation – The verified documents are concatenated (or encoded) and appended to the original query, forming an expanded query that carries the context of already‑found evidence.
  4. Second Retrieval (and beyond) – The expanded query is run through the same retriever to fetch new documents that were missed the first time. Steps 2‑4 can repeat for multiple rounds until a stopping criterion (e.g., no new high‑scoring docs) is met.

The verifier is deliberately lightweight (often a cross‑encoder or a simple similarity model) so that the extra latency per round stays modest. The whole loop can be executed at inference time without re‑training the retriever, though the authors also experiment with fine‑tuning the retriever to better handle the augmented queries.

Results & Findings

DatasetBaseline Retriever (single‑round)RVR (2‑round)Relative Gain
QAMPARI (multi‑answer)58 % complete recall63 %+10 %
QUEST (out‑of‑domain)71 %74 %+4 %
WebQuestionsSP68 %71 %+4 %
  • Gains are consistent across different retriever families (BM25, DPR, ColBERT).
  • Fine‑tuning the retriever for the RVR loop adds an extra ~1‑2 % absolute improvement.
  • The verifier’s precision is high (≈90 % on the filtered set), ensuring that the query augmentation does not introduce noise.
  • Ablation studies confirm that both the verification step and the query augmentation are essential; removing either drops performance back to baseline levels.

Practical Implications

  • Better coverage for open‑domain QA assistants – Voice assistants, chatbots, and search‑augmented LLMs can retrieve more complete answer sets, reducing “I don’t know” failures.
  • Reduced need for massive index expansions – By re‑using the same index with smarter query formulation, developers can achieve higher recall without scaling storage.
  • Plug‑and‑play component – The verifier can be trained on a small, domain‑specific QA dataset and then reused across multiple products, making it attractive for enterprises with limited annotation budgets.
  • Improved downstream reasoning – When a downstream answer‑generation model receives a richer set of evidence, its factual accuracy and answer diversity improve, which is critical for applications like medical QA or legal research.
  • Cost‑effective scaling – Since the extra retrieval round is just another pass over the existing index, the incremental compute cost is modest compared to training a new, larger retriever from scratch.

Limitations & Future Work

  • Latency overhead – Each additional retrieval round adds latency; real‑time systems may need to cap the number of rounds or use approximate verification.
  • Verifier dependence – The approach assumes a verifier that can reliably separate high‑quality docs; in domains with scarce labeled data, verifier performance may degrade.
  • Query drift risk – Poorly filtered documents could steer the expanded query away from the original intent, especially in highly ambiguous queries.
  • Future directions suggested by the authors include adaptive stopping criteria, tighter integration with generative LLMs (e.g., using the verifier’s confidence as a prompt), and exploring multi‑modal evidence (images, tables) within the RVR loop.

Authors

  • Deniz Qian
  • Hung‑Ting Chen
  • Eunsol Choi

Paper Information

  • arXiv ID: 2602.18425v1
  • Categories: cs.CL, cs.IR
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »