[Paper] SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Published: 2 months ago (February 5, 2026 at 01:25 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.05975v1

Overview

The paper “SAGE: Benchmarking and Improving Retrieval for Deep Research Agents” investigates whether large‑language‑model (LLM)‑based retrievers can reliably feed scientific literature to autonomous research agents. By building a new benchmark (SAGE) that spans 1,200 realistic research queries across four domains and a 200 k‑paper corpus, the authors expose a surprising gap: current deep research agents still stumble on “reasoning‑intensive” retrieval, and classic BM25 outperforms the latest LLM retrievers by a large margin.

Key Contributions

SAGE benchmark – a publicly released dataset of 1,200 multi‑step scientific queries plus relevance judgments over a 200 k‑paper corpus, covering biology, chemistry, computer science, and physics.
Comprehensive evaluation of six state‑of‑the‑art deep research agents, revealing systematic weaknesses in their retrieval pipelines.
Empirical comparison of BM25, a traditional sparse retriever, against two strong LLM‑based retrievers (ReasonIR and gte‑Qwen2‑7B‑instruct), showing BM25 is ~30 % more effective on this task.
Corpus‑level test‑time scaling framework that uses an LLM to enrich each document with structured metadata and keyword tags, making it easier for off‑the‑shelf retrievers to surface relevant papers.
Performance gains of +8 % on short‑form factoid questions and +2 % on open‑ended, multi‑step queries after applying the augmentation pipeline.

Methodology

Benchmark Construction – The authors curated 1,200 queries that mimic real research workflows (e.g., “What are the latest methods for single‑cell RNA‑seq data integration?”). Each query is annotated with a set of gold‑standard papers drawn from expert judgments.
Agent Selection – Six deep research agents (including DR‑Tulu, ReAct‑based agents, and others) were run end‑to‑end on the benchmark. Agents internally decompose the query, generate sub‑queries, and invoke a retriever to fetch documents.
Retriever Comparison – For each agent, three retrieval back‑ends were swapped in: (a) BM25 (Lucene implementation), (b) ReasonIR (LLM‑augmented dense retriever), and (c) gte‑Qwen2‑7B‑instruct (instruction‑tuned LLM). Retrieval quality was measured with nDCG@10 and Recall@100.
Test‑time Scaling – An auxiliary LLM processes the entire corpus once, extracting domain‑specific metadata (e.g., experiment type, dataset name) and a concise keyword list per paper. The enriched index is then queried by the same retrievers without any model fine‑tuning.

Results & Findings

Retriever	nDCG@10 (short‑form)	nDCG@10 (open‑ended)
BM25	0.42	0.35
ReasonIR	0.30	0.26
gte‑Qwen2‑7B‑instruct	0.28	0.24

BM25 wins: Across all agents, BM25 consistently outperforms the LLM‑based retrievers by ~30 % in ranking quality.
Keyword‑driven sub‑queries: Agents tend to generate short, keyword‑heavy sub‑queries, which play to BM25’s strengths and expose the brittleness of dense/LLM retrievers that rely on semantic matching.
Corpus augmentation helps: Adding LLM‑generated metadata and keyword tags lifts BM25’s nDCG@10 to 0.46 (short‑form) and 0.38 (open‑ended), while dense retrievers gain modestly (+2–3 %).
Agent variance: Even the best‑performing agent (DR‑Tulu) only reaches 70 % of the oracle upper bound, indicating ample headroom for retrieval‑aware reasoning.

Practical Implications

Retrieval‑first design: For developers building autonomous research assistants, a robust BM25 pipeline (or hybrid with sparse + dense) remains the safest baseline, especially when the agent’s query generation is keyword‑centric.
Metadata enrichment is cheap and effective: Running an LLM once over the corpus to inject structured tags can be integrated into existing indexing pipelines (e.g., Elasticsearch) without retraining retrieval models.
Prompt engineering matters: If agents are to benefit from LLM retrievers, they need to generate richer, context‑aware sub‑queries (e.g., “Explain the principle behind X‑ray crystallography in protein structure determination”).
Evaluation standards: The SAGE benchmark offers a ready‑made test‑bed for any new retrieval component, encouraging reproducible comparisons across domains.
Potential for industry: Companies building literature‑review tools, patent search, or scientific knowledge bases can adopt the augmentation framework to boost recall without incurring heavy compute costs.

Limitations & Future Work

Domain coverage: SAGE focuses on four scientific fields; performance may differ in humanities or engineering domains.
Static corpus: The benchmark uses a fixed snapshot of papers; real‑world systems must handle continuously growing literature and versioning.
Agent diversity: Only six agents were evaluated; newer architectures (e.g., Retrieval‑Augmented Generation with LoRA‑fine‑tuned LLMs) could behave differently.
LLM scaling: The study employed a 7B‑parameter model; larger instruction‑tuned models might close the gap, but the cost‑benefit trade‑off remains unexplored.
User‑centric metrics: The evaluation relies on ranking metrics; future work could incorporate downstream task success (e.g., hypothesis generation accuracy) to better capture real‑world impact.

Authors

Tiansheng Hu
Yilun Zhao
Canyu Zhang
Arman Cohan
Chen Zhao

Paper Information

arXiv ID: 2602.05975v1
Categories: cs.IR, cs.CL
Published: February 5, 2026
PDF: Download PDF

[Paper] SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks