[Paper] SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
Source: arXiv - 2602.05975v1
Overview
The paper “SAGE: Benchmarking and Improving Retrieval for Deep Research Agents” investigates whether large‑language‑model (LLM)‑based retrievers can reliably feed scientific literature to autonomous research agents. By building a new benchmark (SAGE) that spans 1,200 realistic research queries across four domains and a 200 k‑paper corpus, the authors expose a surprising gap: current deep research agents still stumble on “reasoning‑intensive” retrieval, and classic BM25 outperforms the latest LLM retrievers by a large margin.
Key Contributions
- SAGE benchmark – a publicly released dataset of 1,200 multi‑step scientific queries plus relevance judgments over a 200 k‑paper corpus, covering biology, chemistry, computer science, and physics.
- Comprehensive evaluation of six state‑of‑the‑art deep research agents, revealing systematic weaknesses in their retrieval pipelines.
- Empirical comparison of BM25, a traditional sparse retriever, against two strong LLM‑based retrievers (ReasonIR and gte‑Qwen2‑7B‑instruct), showing BM25 is ~30 % more effective on this task.
- Corpus‑level test‑time scaling framework that uses an LLM to enrich each document with structured metadata and keyword tags, making it easier for off‑the‑shelf retrievers to surface relevant papers.
- Performance gains of +8 % on short‑form factoid questions and +2 % on open‑ended, multi‑step queries after applying the augmentation pipeline.
Methodology
- Benchmark Construction – The authors curated 1,200 queries that mimic real research workflows (e.g., “What are the latest methods for single‑cell RNA‑seq data integration?”). Each query is annotated with a set of gold‑standard papers drawn from expert judgments.
- Agent Selection – Six deep research agents (including DR‑Tulu, ReAct‑based agents, and others) were run end‑to‑end on the benchmark. Agents internally decompose the query, generate sub‑queries, and invoke a retriever to fetch documents.
- Retriever Comparison – For each agent, three retrieval back‑ends were swapped in: (a) BM25 (Lucene implementation), (b) ReasonIR (LLM‑augmented dense retriever), and (c) gte‑Qwen2‑7B‑instruct (instruction‑tuned LLM). Retrieval quality was measured with nDCG@10 and Recall@100.
- Test‑time Scaling – An auxiliary LLM processes the entire corpus once, extracting domain‑specific metadata (e.g., experiment type, dataset name) and a concise keyword list per paper. The enriched index is then queried by the same retrievers without any model fine‑tuning.
Results & Findings
| Retriever | nDCG@10 (short‑form) | nDCG@10 (open‑ended) |
|---|---|---|
| BM25 | 0.42 | 0.35 |
| ReasonIR | 0.30 | 0.26 |
| gte‑Qwen2‑7B‑instruct | 0.28 | 0.24 |
- BM25 wins: Across all agents, BM25 consistently outperforms the LLM‑based retrievers by ~30 % in ranking quality.
- Keyword‑driven sub‑queries: Agents tend to generate short, keyword‑heavy sub‑queries, which play to BM25’s strengths and expose the brittleness of dense/LLM retrievers that rely on semantic matching.
- Corpus augmentation helps: Adding LLM‑generated metadata and keyword tags lifts BM25’s nDCG@10 to 0.46 (short‑form) and 0.38 (open‑ended), while dense retrievers gain modestly (+2–3 %).
- Agent variance: Even the best‑performing agent (DR‑Tulu) only reaches 70 % of the oracle upper bound, indicating ample headroom for retrieval‑aware reasoning.
Practical Implications
- Retrieval‑first design: For developers building autonomous research assistants, a robust BM25 pipeline (or hybrid with sparse + dense) remains the safest baseline, especially when the agent’s query generation is keyword‑centric.
- Metadata enrichment is cheap and effective: Running an LLM once over the corpus to inject structured tags can be integrated into existing indexing pipelines (e.g., Elasticsearch) without retraining retrieval models.
- Prompt engineering matters: If agents are to benefit from LLM retrievers, they need to generate richer, context‑aware sub‑queries (e.g., “Explain the principle behind X‑ray crystallography in protein structure determination”).
- Evaluation standards: The SAGE benchmark offers a ready‑made test‑bed for any new retrieval component, encouraging reproducible comparisons across domains.
- Potential for industry: Companies building literature‑review tools, patent search, or scientific knowledge bases can adopt the augmentation framework to boost recall without incurring heavy compute costs.
Limitations & Future Work
- Domain coverage: SAGE focuses on four scientific fields; performance may differ in humanities or engineering domains.
- Static corpus: The benchmark uses a fixed snapshot of papers; real‑world systems must handle continuously growing literature and versioning.
- Agent diversity: Only six agents were evaluated; newer architectures (e.g., Retrieval‑Augmented Generation with LoRA‑fine‑tuned LLMs) could behave differently.
- LLM scaling: The study employed a 7B‑parameter model; larger instruction‑tuned models might close the gap, but the cost‑benefit trade‑off remains unexplored.
- User‑centric metrics: The evaluation relies on ranking metrics; future work could incorporate downstream task success (e.g., hypothesis generation accuracy) to better capture real‑world impact.
Authors
- Tiansheng Hu
- Yilun Zhao
- Canyu Zhang
- Arman Cohan
- Chen Zhao
Paper Information
- arXiv ID: 2602.05975v1
- Categories: cs.IR, cs.CL
- Published: February 5, 2026
- PDF: Download PDF