[Paper] Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Published: (May 5, 2026 at 01:42 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.04018v1

Overview

The paper tackles a growing pain point in “agentic” search systems—software agents that iteratively retrieve information, reason over it, and synthesize answers. Traditional retrieval models focus on finding a single, topically‑relevant document, but agents need evidence portfolios: multiple, complementary passages that together support a chain of reasoning. The authors introduce a richer benchmark (BRIGHT‑Pro) and a new training corpus (RTriever‑Synth) that together enable more realistic evaluation and stronger retrievers for these reasoning‑intensive tasks.

Key Contributions

  • BRIGHT‑Pro benchmark – an expert‑annotated extension of the existing BRIGHT dataset that supplies multi‑aspect gold evidence for each query and defines two evaluation protocols: (1) static retrieval and (2) agentic, iterative retrieval.
  • Aspect‑decomposed synthetic corpus (RTriever‑Synth) – automatically generated passages that (a) cover distinct aspects of a query (complementary positives) and (b) provide positive‑conditioned hard negatives to teach models to avoid redundant hits.
  • LoRA fine‑tuning of a 4‑billion‑parameter embedding model (RTriever‑4B) – built on Qwen3‑Embedding‑4B, showing that lightweight adaptation can yield large gains for reasoning‑intensive retrieval.
  • Comprehensive empirical study – compares lexical, general‑purpose, and reasoning‑oriented retrievers under both standard and agentic metrics, revealing hidden failure modes of conventional evaluation.

Methodology

  1. Benchmark Construction

    • Human experts expanded each query in the original BRIGHT set with multiple gold passages, each covering a distinct reasoning aspect (e.g., background, counter‑example, quantitative evidence).
    • Two evaluation settings were defined:
      Static: a single retrieval round, mirroring classic IR tests.
      Agentic: a simulated loop where the agent can request additional passages after each reasoning step, mimicking real‑world tool‑use.
  2. Synthetic Training Corpus (RTriever‑Synth)

    • Starting from a large text collection, the authors used large‑language models (LLMs) to decompose each query into explicit aspects.
    • For each aspect, the LLM generated a positive passage and a hard negative conditioned on the positive (i.e., similar wording but missing the crucial piece of evidence).
    • This yields a balanced set of complementary positives and challenging negatives that teach the retriever to diversify its results.
  3. Model Fine‑tuning

    • The base embedding model (Qwen3‑Embedding‑4B) was adapted using Low‑Rank Adaptation (LoRA), a parameter‑efficient technique that adds a small set of trainable matrices.
    • Training minimized a contrastive loss that pushes aspect‑specific positives close together while pulling the hard negatives apart.
  4. Evaluation Pipeline

    • Metrics include standard recall@k, Aspect‑Recall (how many distinct aspects are covered), and Agentic Success Rate (whether the simulated agent can complete a reasoning task within a budget of retrieval steps).

Results & Findings

RetrieverStatic Recall@10Aspect‑Recall@10Agentic Success (≤5 steps)
BM2538.2 %21.5 %12.3 %
DPR (general)45.7 %28.9 %18.7 %
RTriever‑4B (proposed)61.4 %49.2 %34.5 %
  • Aspect‑aware evaluation surfaces gaps: many strong lexical models achieve decent overall recall but miss critical aspects, leading to low Aspect‑Recall.
  • Agentic protocol amplifies differences: models that retrieve redundant passages stall the simulated agent, dramatically lowering success rates.
  • RTriever‑4B closes the gap: thanks to the aspect‑decomposed training data, it learns to surface a diverse set of evidence, boosting both static and agentic metrics.

Qualitative analysis shows RTriever‑4B often returns a background article, a data table, and a counter‑argument in the first three hits—exactly the mix an autonomous reasoning agent needs.

Practical Implications

  • Better tool‑use for AI assistants – Developers building ChatGPT‑style agents, code‑assistants, or research assistants can plug RTriever‑4B (or the training pipeline) into their retrieval layer to give the downstream LLM a richer evidence set, reducing hallucinations.
  • Reduced retrieval budget – Because the model surfaces complementary evidence early, agents need fewer retrieval cycles, saving API calls and latency in production systems.
  • Fine‑tuning recipe for niche domains – The LoRA‑based approach means teams can adapt a large embedding model to domain‑specific aspect structures (e.g., legal reasoning, medical diagnosis) with modest compute.
  • Benchmark for product teams – BRIGHT‑Pro offers a ready‑to‑use test suite that mirrors real‑world iterative search, enabling more honest QA of retrieval components before release.

Limitations & Future Work

  • Annotation cost – Expanding gold evidence to multiple aspects required expert labor; scaling BRIGHT‑Pro to thousands of queries may be prohibitive.
  • Synthetic bias – RTriever‑Synth relies on LLM‑generated passages, which inherit the model’s biases and may not capture all real‑world nuance.
  • Agentic simulation simplifications – The paper’s agentic protocol assumes a fixed budget and deterministic reasoning steps; actual user‑driven agents may behave more unpredictably.
  • Future directions suggested include (1) crowdsourcing aspect annotations to enlarge the benchmark, (2) incorporating user‑feedback loops for on‑the‑fly aspect discovery, and (3) extending the training pipeline to multimodal evidence (tables, code snippets, images).

Authors

  • Yilun Zhao
  • Jinbiao Wei
  • Tingyu Song
  • Siyue Zhang
  • Chen Zhao
  • Arman Cohan

Paper Information

  • arXiv ID: 2605.04018v1
  • Categories: cs.CL, cs.IR
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »