[Paper] A Benchmark for Deep Information Synthesis
Source: arXiv - 2602.21143v1
Overview
The paper introduces DEEPSYNTH, a new benchmark that pushes large‑language‑model (LLM) agents to tackle realistic, multi‑step problems requiring data gathering, synthesis, and structured reasoning. By covering seven domains and 67 countries, the authors show that today’s best models still struggle to turn raw information into reliable insights.
Key Contributions
- DEEPSYNTH benchmark: 120 carefully crafted tasks that mimic real‑world research or analysis workflows (e.g., policy comparison, market trend extraction).
- Multi‑stage data‑collection pipeline: annotators source official data, formulate hypotheses, manually analyze results, and produce verifiable answer keys.
- Comprehensive evaluation: 11 state‑of‑the‑art LLMs and research agents are tested, revealing a ceiling F1 of 8.97 (standard metric) and 17.5 on an LLM‑judge scoring system.
- Error analysis: systematic study of hallucinations, information‑overload failures, and reasoning gaps that dominate current agent performance.
- Open resources: the benchmark, data‑collection scripts, and evaluation code are released publicly to foster reproducible research.
Methodology
- Task Design – Domain experts pick a real‑world question (e.g., “How did renewable‑energy subsidies change across EU nations in 2022?”).
- Data Gathering – Annotators retrieve official datasets, reports, or APIs from multiple sources (government portals, statistical bureaus, etc.).
- Hypothesis & Manual Analysis – They draft a plausible answer, run the necessary calculations or visual analyses, and verify the result.
- Task Formalization – The final task includes a clear prompt, required data links, and a ground‑truth answer expressed in a structured format (tables, JSON, or short prose).
- Agent Evaluation – Each LLM agent receives the prompt and data URLs, runs its internal tool‑use pipeline (web browsing, code execution, etc.), and returns an answer.
- Scoring – Answers are compared against the ground truth using exact‑match F1 for structured fields and an LLM‑judge that rates overall correctness and reasoning quality.
The pipeline is deliberately “human‑in‑the‑loop” to ensure that tasks are neither trivial fact look‑ups nor impossible open‑ended research questions.
Results & Findings
- The best‑performing model (GPT‑4‑based research agent) achieved 8.97 % F1 and 17.5 / 100 on the LLM‑judge metric—far below human baseline (>90 %).
- Hallucination rate: >60 % of generated answers contained fabricated figures or citations.
- Information overload: Agents often stopped after the first few sources, missing critical data points that were deeper in the provided list.
- Reasoning failures: Even when the correct data was retrieved, models struggled to combine it into coherent, multi‑step conclusions (e.g., calculating year‑over‑year growth across countries).
- Simple retrieval‑oriented baselines (e.g., “answer with the first table found”) performed comparably to the most advanced agents, underscoring that current systems are still retrieval‑centric.
Practical Implications
- Tool‑building: Developers of autonomous agents (e.g., AI‑assistants for market research, compliance checks, or scientific literature reviews) now have a concrete yardstick to measure true synthesis capability.
- Prompt engineering: The benchmark highlights the need for prompts that explicitly request source citation, step‑by‑step reasoning, and verification loops.
- Safety & reliability: High hallucination rates on DEEPSYNTH suggest that deploying LLM agents in high‑stakes domains (finance, policy, healthcare) requires additional guardrails such as external fact‑checkers or human‑in‑the‑loop verification.
- Product roadmaps: Companies can prioritize features like “structured data extraction APIs,” “long‑context memory,” and “iterative tool use” to close the gap shown by DEEPSYNTH.
Limitations & Future Work
- Domain coverage: While seven domains are diverse, they still omit areas like software engineering debugging or creative design, which may exhibit different synthesis challenges.
- Human annotation bias: The ground‑truth answers depend on annotators’ manual analyses; subtle alternative interpretations could be penalized as errors.
- Scalability: The benchmark’s size (120 tasks) is modest compared to large‑scale QA sets, limiting statistical power for fine‑grained model comparisons.
- Future directions proposed include expanding to thousands of tasks, automating parts of the data‑collection pipeline, and integrating dynamic, time‑sensitive sources (e.g., live APIs) to test agents’ ability to adapt to changing information.
Authors
- Debjit Paul
- Daniel Murphy
- Milan Gritta
- Ronald Cardenas
- Victor Prokhorov
- Lena Sophia Bolliger
- Aysim Toker
- Roy Miles
- Andreea-Maria Oncescu
- Jasivan Alex Sivakumar
- Philipp Borchert
- Ismail Elezi
- Meiru Zhang
- Ka Yiu Lee
- Guchun Zhang
- Jun Wang
- Gerasimos Lampouras
Paper Information
- arXiv ID: 2602.21143v1
- Categories: cs.AI, cs.CL, cs.IR, cs.LG
- Published: February 24, 2026
- PDF: Download PDF