[Paper] A Benchmark for Deep Information Synthesis

Published: 3 days ago (February 24, 2026 at 12:43 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21143v1

Overview

The paper introduces DEEPSYNTH, a new benchmark that pushes large‑language‑model (LLM) agents to tackle realistic, multi‑step problems requiring data gathering, synthesis, and structured reasoning. By covering seven domains and 67 countries, the authors show that today’s best models still struggle to turn raw information into reliable insights.

Key Contributions

DEEPSYNTH benchmark: 120 carefully crafted tasks that mimic real‑world research or analysis workflows (e.g., policy comparison, market trend extraction).
Multi‑stage data‑collection pipeline: annotators source official data, formulate hypotheses, manually analyze results, and produce verifiable answer keys.
Comprehensive evaluation: 11 state‑of‑the‑art LLMs and research agents are tested, revealing a ceiling F1 of 8.97 (standard metric) and 17.5 on an LLM‑judge scoring system.
Error analysis: systematic study of hallucinations, information‑overload failures, and reasoning gaps that dominate current agent performance.
Open resources: the benchmark, data‑collection scripts, and evaluation code are released publicly to foster reproducible research.

Methodology

Task Design – Domain experts pick a real‑world question (e.g., “How did renewable‑energy subsidies change across EU nations in 2022?”).
Data Gathering – Annotators retrieve official datasets, reports, or APIs from multiple sources (government portals, statistical bureaus, etc.).
Hypothesis & Manual Analysis – They draft a plausible answer, run the necessary calculations or visual analyses, and verify the result.
Task Formalization – The final task includes a clear prompt, required data links, and a ground‑truth answer expressed in a structured format (tables, JSON, or short prose).
Agent Evaluation – Each LLM agent receives the prompt and data URLs, runs its internal tool‑use pipeline (web browsing, code execution, etc.), and returns an answer.
Scoring – Answers are compared against the ground truth using exact‑match F1 for structured fields and an LLM‑judge that rates overall correctness and reasoning quality.

The pipeline is deliberately “human‑in‑the‑loop” to ensure that tasks are neither trivial fact look‑ups nor impossible open‑ended research questions.

Results & Findings

The best‑performing model (GPT‑4‑based research agent) achieved 8.97 % F1 and 17.5 / 100 on the LLM‑judge metric—far below human baseline (>90 %).
Hallucination rate: >60 % of generated answers contained fabricated figures or citations.
Information overload: Agents often stopped after the first few sources, missing critical data points that were deeper in the provided list.
Reasoning failures: Even when the correct data was retrieved, models struggled to combine it into coherent, multi‑step conclusions (e.g., calculating year‑over‑year growth across countries).
Simple retrieval‑oriented baselines (e.g., “answer with the first table found”) performed comparably to the most advanced agents, underscoring that current systems are still retrieval‑centric.

Practical Implications

Tool‑building: Developers of autonomous agents (e.g., AI‑assistants for market research, compliance checks, or scientific literature reviews) now have a concrete yardstick to measure true synthesis capability.
Prompt engineering: The benchmark highlights the need for prompts that explicitly request source citation, step‑by‑step reasoning, and verification loops.
Safety & reliability: High hallucination rates on DEEPSYNTH suggest that deploying LLM agents in high‑stakes domains (finance, policy, healthcare) requires additional guardrails such as external fact‑checkers or human‑in‑the‑loop verification.
Product roadmaps: Companies can prioritize features like “structured data extraction APIs,” “long‑context memory,” and “iterative tool use” to close the gap shown by DEEPSYNTH.

Limitations & Future Work

Domain coverage: While seven domains are diverse, they still omit areas like software engineering debugging or creative design, which may exhibit different synthesis challenges.
Human annotation bias: The ground‑truth answers depend on annotators’ manual analyses; subtle alternative interpretations could be penalized as errors.
Scalability: The benchmark’s size (120 tasks) is modest compared to large‑scale QA sets, limiting statistical power for fine‑grained model comparisons.
Future directions proposed include expanding to thousands of tasks, automating parts of the data‑collection pipeline, and integrating dynamic, time‑sensitive sources (e.g., live APIs) to test agents’ ability to adapt to changing information.

Authors

Debjit Paul
Daniel Murphy
Milan Gritta
Ronald Cardenas
Victor Prokhorov
Lena Sophia Bolliger
Aysim Toker
Roy Miles
Andreea-Maria Oncescu
Jasivan Alex Sivakumar
Philipp Borchert
Ismail Elezi
Meiru Zhang
Ka Yiu Lee
Guchun Zhang
Jun Wang
Gerasimos Lampouras

Paper Information

arXiv ID: 2602.21143v1
Categories: cs.AI, cs.CL, cs.IR, cs.LG
Published: February 24, 2026
PDF: Download PDF

[Paper] A Benchmark for Deep Information Synthesis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?