[Paper] Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Published: 5 days ago (May 5, 2026 at 01:42 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.04018v1

Overview

The paper tackles a growing pain point in “agentic” search systems—software agents that iteratively retrieve information, reason over it, and synthesize answers. Traditional retrieval models focus on finding a single, topically‑relevant document, but agents need evidence portfolios: multiple, complementary passages that together support a chain of reasoning. The authors introduce a richer benchmark (BRIGHT‑Pro) and a new training corpus (RTriever‑Synth) that together enable more realistic evaluation and stronger retrievers for these reasoning‑intensive tasks.

Key Contributions

BRIGHT‑Pro benchmark – an expert‑annotated extension of the existing BRIGHT dataset that supplies multi‑aspect gold evidence for each query and defines two evaluation protocols: (1) static retrieval and (2) agentic, iterative retrieval.
Aspect‑decomposed synthetic corpus (RTriever‑Synth) – automatically generated passages that (a) cover distinct aspects of a query (complementary positives) and (b) provide positive‑conditioned hard negatives to teach models to avoid redundant hits.
LoRA fine‑tuning of a 4‑billion‑parameter embedding model (RTriever‑4B) – built on Qwen3‑Embedding‑4B, showing that lightweight adaptation can yield large gains for reasoning‑intensive retrieval.
Comprehensive empirical study – compares lexical, general‑purpose, and reasoning‑oriented retrievers under both standard and agentic metrics, revealing hidden failure modes of conventional evaluation.

Methodology

Benchmark Construction
- Human experts expanded each query in the original BRIGHT set with multiple gold passages, each covering a distinct reasoning aspect (e.g., background, counter‑example, quantitative evidence).
- Two evaluation settings were defined:
  Static: a single retrieval round, mirroring classic IR tests.
  Agentic: a simulated loop where the agent can request additional passages after each reasoning step, mimicking real‑world tool‑use.
Synthetic Training Corpus (RTriever‑Synth)
- Starting from a large text collection, the authors used large‑language models (LLMs) to decompose each query into explicit aspects.
- For each aspect, the LLM generated a positive passage and a hard negative conditioned on the positive (i.e., similar wording but missing the crucial piece of evidence).
- This yields a balanced set of complementary positives and challenging negatives that teach the retriever to diversify its results.
Model Fine‑tuning
- The base embedding model (Qwen3‑Embedding‑4B) was adapted using Low‑Rank Adaptation (LoRA), a parameter‑efficient technique that adds a small set of trainable matrices.
- Training minimized a contrastive loss that pushes aspect‑specific positives close together while pulling the hard negatives apart.
Evaluation Pipeline
- Metrics include standard recall@k, Aspect‑Recall (how many distinct aspects are covered), and Agentic Success Rate (whether the simulated agent can complete a reasoning task within a budget of retrieval steps).

Results & Findings

Retriever	Static Recall@10	Aspect‑Recall@10	Agentic Success (≤5 steps)
BM25	38.2 %	21.5 %	12.3 %
DPR (general)	45.7 %	28.9 %	18.7 %
RTriever‑4B (proposed)	61.4 %	49.2 %	34.5 %

Aspect‑aware evaluation surfaces gaps: many strong lexical models achieve decent overall recall but miss critical aspects, leading to low Aspect‑Recall.
Agentic protocol amplifies differences: models that retrieve redundant passages stall the simulated agent, dramatically lowering success rates.
RTriever‑4B closes the gap: thanks to the aspect‑decomposed training data, it learns to surface a diverse set of evidence, boosting both static and agentic metrics.

Qualitative analysis shows RTriever‑4B often returns a background article, a data table, and a counter‑argument in the first three hits—exactly the mix an autonomous reasoning agent needs.

Practical Implications

Better tool‑use for AI assistants – Developers building ChatGPT‑style agents, code‑assistants, or research assistants can plug RTriever‑4B (or the training pipeline) into their retrieval layer to give the downstream LLM a richer evidence set, reducing hallucinations.
Reduced retrieval budget – Because the model surfaces complementary evidence early, agents need fewer retrieval cycles, saving API calls and latency in production systems.
Fine‑tuning recipe for niche domains – The LoRA‑based approach means teams can adapt a large embedding model to domain‑specific aspect structures (e.g., legal reasoning, medical diagnosis) with modest compute.
Benchmark for product teams – BRIGHT‑Pro offers a ready‑to‑use test suite that mirrors real‑world iterative search, enabling more honest QA of retrieval components before release.

Limitations & Future Work

Annotation cost – Expanding gold evidence to multiple aspects required expert labor; scaling BRIGHT‑Pro to thousands of queries may be prohibitive.
Synthetic bias – RTriever‑Synth relies on LLM‑generated passages, which inherit the model’s biases and may not capture all real‑world nuance.
Agentic simulation simplifications – The paper’s agentic protocol assumes a fixed budget and deterministic reasoning steps; actual user‑driven agents may behave more unpredictably.
Future directions suggested include (1) crowdsourcing aspect annotations to enlarge the benchmark, (2) incorporating user‑feedback loops for on‑the‑fly aspect discovery, and (3) extending the training pipeline to multimodal evidence (tables, code snippets, images).

Authors

Yilun Zhao
Jinbiao Wei
Tingyu Song
Siyue Zhang
Chen Zhao
Arman Cohan

Paper Information

arXiv ID: 2605.04018v1
Categories: cs.CL, cs.IR
Published: May 5, 2026
PDF: Download PDF

[Paper] Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation