[Paper] SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Published: (February 5, 2026 at 01:25 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.05975v1

Overview

The paper “SAGE: Benchmarking and Improving Retrieval for Deep Research Agents” investigates whether large‑language‑model (LLM)‑based retrievers can reliably feed scientific literature to autonomous research agents. By building a new benchmark (SAGE) that spans 1,200 realistic research queries across four domains and a 200 k‑paper corpus, the authors expose a surprising gap: current deep research agents still stumble on “reasoning‑intensive” retrieval, and classic BM25 outperforms the latest LLM retrievers by a large margin.

Key Contributions

  • SAGE benchmark – a publicly released dataset of 1,200 multi‑step scientific queries plus relevance judgments over a 200 k‑paper corpus, covering biology, chemistry, computer science, and physics.
  • Comprehensive evaluation of six state‑of‑the‑art deep research agents, revealing systematic weaknesses in their retrieval pipelines.
  • Empirical comparison of BM25, a traditional sparse retriever, against two strong LLM‑based retrievers (ReasonIR and gte‑Qwen2‑7B‑instruct), showing BM25 is ~30 % more effective on this task.
  • Corpus‑level test‑time scaling framework that uses an LLM to enrich each document with structured metadata and keyword tags, making it easier for off‑the‑shelf retrievers to surface relevant papers.
  • Performance gains of +8 % on short‑form factoid questions and +2 % on open‑ended, multi‑step queries after applying the augmentation pipeline.

Methodology

  1. Benchmark Construction – The authors curated 1,200 queries that mimic real research workflows (e.g., “What are the latest methods for single‑cell RNA‑seq data integration?”). Each query is annotated with a set of gold‑standard papers drawn from expert judgments.
  2. Agent Selection – Six deep research agents (including DR‑Tulu, ReAct‑based agents, and others) were run end‑to‑end on the benchmark. Agents internally decompose the query, generate sub‑queries, and invoke a retriever to fetch documents.
  3. Retriever Comparison – For each agent, three retrieval back‑ends were swapped in: (a) BM25 (Lucene implementation), (b) ReasonIR (LLM‑augmented dense retriever), and (c) gte‑Qwen2‑7B‑instruct (instruction‑tuned LLM). Retrieval quality was measured with nDCG@10 and Recall@100.
  4. Test‑time Scaling – An auxiliary LLM processes the entire corpus once, extracting domain‑specific metadata (e.g., experiment type, dataset name) and a concise keyword list per paper. The enriched index is then queried by the same retrievers without any model fine‑tuning.

Results & Findings

RetrievernDCG@10 (short‑form)nDCG@10 (open‑ended)
BM250.420.35
ReasonIR0.300.26
gte‑Qwen2‑7B‑instruct0.280.24
  • BM25 wins: Across all agents, BM25 consistently outperforms the LLM‑based retrievers by ~30 % in ranking quality.
  • Keyword‑driven sub‑queries: Agents tend to generate short, keyword‑heavy sub‑queries, which play to BM25’s strengths and expose the brittleness of dense/LLM retrievers that rely on semantic matching.
  • Corpus augmentation helps: Adding LLM‑generated metadata and keyword tags lifts BM25’s nDCG@10 to 0.46 (short‑form) and 0.38 (open‑ended), while dense retrievers gain modestly (+2–3 %).
  • Agent variance: Even the best‑performing agent (DR‑Tulu) only reaches 70 % of the oracle upper bound, indicating ample headroom for retrieval‑aware reasoning.

Practical Implications

  • Retrieval‑first design: For developers building autonomous research assistants, a robust BM25 pipeline (or hybrid with sparse + dense) remains the safest baseline, especially when the agent’s query generation is keyword‑centric.
  • Metadata enrichment is cheap and effective: Running an LLM once over the corpus to inject structured tags can be integrated into existing indexing pipelines (e.g., Elasticsearch) without retraining retrieval models.
  • Prompt engineering matters: If agents are to benefit from LLM retrievers, they need to generate richer, context‑aware sub‑queries (e.g., “Explain the principle behind X‑ray crystallography in protein structure determination”).
  • Evaluation standards: The SAGE benchmark offers a ready‑made test‑bed for any new retrieval component, encouraging reproducible comparisons across domains.
  • Potential for industry: Companies building literature‑review tools, patent search, or scientific knowledge bases can adopt the augmentation framework to boost recall without incurring heavy compute costs.

Limitations & Future Work

  • Domain coverage: SAGE focuses on four scientific fields; performance may differ in humanities or engineering domains.
  • Static corpus: The benchmark uses a fixed snapshot of papers; real‑world systems must handle continuously growing literature and versioning.
  • Agent diversity: Only six agents were evaluated; newer architectures (e.g., Retrieval‑Augmented Generation with LoRA‑fine‑tuned LLMs) could behave differently.
  • LLM scaling: The study employed a 7B‑parameter model; larger instruction‑tuned models might close the gap, but the cost‑benefit trade‑off remains unexplored.
  • User‑centric metrics: The evaluation relies on ranking metrics; future work could incorporate downstream task success (e.g., hypothesis generation accuracy) to better capture real‑world impact.

Authors

  • Tiansheng Hu
  • Yilun Zhao
  • Canyu Zhang
  • Arman Cohan
  • Chen Zhao

Paper Information

  • arXiv ID: 2602.05975v1
  • Categories: cs.IR, cs.CL
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »