[Paper] Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
Source: arXiv - 2605.06635v1
Overview
The paper presents the first systematic way to measure how well large language models (LLMs) actually cite their sources when they act as “deep research agents” that generate long, Markdown‑style reports. By parsing inline citations, pulling the referenced webpages, and checking them for accessibility, relevance, and factual consistency, the authors expose a hidden reliability gap: even top‑tier models often produce citations that look good on the surface but contain inaccurate facts.
Key Contributions
- A reproducible AST‑based citation parser that extracts Markdown‑style references from LLM‑generated documents at scale.
- A three‑dimensional evaluation framework (Link Works, Relevant Content, Fact Check) that closes the loop by retrieving the cited source and letting humans or LLM judges assess it.
- Benchmark results for 14 closed‑source and open‑source LLMs, showing high link‑validity (>94%) and relevance (>80%) but much lower factual accuracy (39‑77%).
- Ablation study on tool‑call depth, revealing that more retrieval calls (2 → 150) actually decrease factual correctness by ~42% for frontier models.
- Open‑source evaluation infrastructure (parser, rubrics, calibration scripts) that the community can reuse for future citation‑quality research.
Methodology
- Report Generation – Each LLM is prompted to write a research‑style report in Markdown, inserting inline citations (
[1],[2], etc.) that include URLs. - AST Parsing – A lightweight abstract‑syntax‑tree (AST) parser walks the Markdown document, extracts every citation block, and normalizes the URLs.
- Source Retrieval – The parser automatically fetches each URL (handling redirects, HTTP errors, and paywalls where possible).
- Evaluation Dimensions
- Link Works – Checks if the URL resolves to a reachable page (status 200).
- Relevant Content – Uses semantic similarity (e.g., embeddings) between the cited passage and the surrounding report text to gauge topical alignment.
- Fact Check – Compares factual statements in the report against the retrieved source using a rubric‑driven LLM‑as‑a‑judge, calibrated against a small human‑annotated set.
- Scoring & Aggregation – Scores are averaged per model and per dimension, enabling direct comparison across the 14 systems.
The entire pipeline is open‑source, containerized, and can be run on a modest GPU‑enabled workstation, making it practical for both academia and industry teams.
Results & Findings
| Model Category | Link Works | Relevant Content | Fact Check |
|---|---|---|---|
| Frontier closed‑source (e.g., GPT‑4, Claude) | 94‑98% | 81‑86% | 39‑57% |
| Strong open‑source (e.g., Llama‑2‑70B) | 92‑95% | 78‑82% | 45‑63% |
| Smaller open‑source (≤13B) | 85‑90% | 70‑75% | 39‑48% |
- Citation surface quality is high: most models reliably produce reachable URLs and generally cite on‑topic material.
- Factual reliability lags: even the best models get the facts right in only about half of the citations.
- Depth hurts accuracy: when a model makes many tool calls (up to 150), its Fact Check score drops by ~42% compared with a shallow 2‑call setting.
- One‑shot success rate: fewer than 50 % of the open‑source models can produce a fully cited report without additional prompting tricks.
Practical Implications
- Tool‑augmented agents need tighter verification loops – Simply pulling more documents does not guarantee better answers; developers should embed fact‑checking after each retrieval step.
- Automated report generators (e.g., for compliance, market analysis, or academic assistance) must expose source verification UI so end‑users can see whether a citation is reachable, relevant, and factually accurate.
- LLM‑as‑a‑judge pipelines can be integrated into CI/CD for AI‑generated content, automatically flagging low‑accuracy citations before deployment.
- Open‑source model selection – Teams that need verifiable citations should favor larger, well‑tuned open‑source models and invest in post‑generation validation rather than relying on raw generation.
- Regulatory compliance – Industries with strict audit trails (finance, pharma, legal) can use the provided framework to certify that AI‑generated documents meet citation standards, reducing liability.
Limitations & Future Work
- Source accessibility bias – Paywalled or dynamically generated pages often fail the Link Works check, potentially penalizing models that cite high‑quality but restricted sources.
- Rubric calibration – The Fact Check dimension relies on LLM judges calibrated on a modest human set; broader human validation could improve reliability.
- Domain coverage – Experiments focus on general‑web sources; specialized domains (e.g., scientific literature behind DOI paywalls) may exhibit different patterns.
- Scalability of retrieval – The current pipeline fetches each URL sequentially; parallelization and caching strategies are needed for large‑scale production use.
Future research directions include extending the parser to handle citation styles beyond Markdown, integrating external fact‑checking APIs, and exploring reinforcement‑learning loops where the agent iteratively refines citations based on verification feedback.
Authors
- Hailey Onweller
- Elias Lumer
- Austin Huber
- Pia Ramchandani
- Vamse Kumar Subbiah
- Corey Feld
Paper Information
- arXiv ID: 2605.06635v1
- Categories: cs.CL
- Published: May 7, 2026
- PDF: Download PDF