[Paper] Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Published: 3 days ago (June 8, 2026 at 01:08 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09748v1

Overview

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

Key Contributions

This paper presents research in the following areas:

cs.AI
cs.CL
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Rishabh Sabharwal
Hongru Wang
Amos Storkey
Jeff Z. Pan

Paper Information

arXiv ID: 2606.09748v1
Categories: cs.AI, cs.CL, cs.LG
Published: June 8, 2026
PDF: Download PDF

[Paper] Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

[Paper] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling