[Paper] Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Published: 3 days ago (May 7, 2026 at 01:46 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06635v1

Overview

The paper presents the first systematic way to measure how well large language models (LLMs) actually cite their sources when they act as “deep research agents” that generate long, Markdown‑style reports. By parsing inline citations, pulling the referenced webpages, and checking them for accessibility, relevance, and factual consistency, the authors expose a hidden reliability gap: even top‑tier models often produce citations that look good on the surface but contain inaccurate facts.

Key Contributions

A reproducible AST‑based citation parser that extracts Markdown‑style references from LLM‑generated documents at scale.
A three‑dimensional evaluation framework (Link Works, Relevant Content, Fact Check) that closes the loop by retrieving the cited source and letting humans or LLM judges assess it.
Benchmark results for 14 closed‑source and open‑source LLMs, showing high link‑validity (>94%) and relevance (>80%) but much lower factual accuracy (39‑77%).
Ablation study on tool‑call depth, revealing that more retrieval calls (2 → 150) actually decrease factual correctness by ~42% for frontier models.
Open‑source evaluation infrastructure (parser, rubrics, calibration scripts) that the community can reuse for future citation‑quality research.

Methodology

Report Generation – Each LLM is prompted to write a research‑style report in Markdown, inserting inline citations ([1], [2], etc.) that include URLs.
AST Parsing – A lightweight abstract‑syntax‑tree (AST) parser walks the Markdown document, extracts every citation block, and normalizes the URLs.
Source Retrieval – The parser automatically fetches each URL (handling redirects, HTTP errors, and paywalls where possible).
Evaluation Dimensions
- Link Works – Checks if the URL resolves to a reachable page (status 200).
- Relevant Content – Uses semantic similarity (e.g., embeddings) between the cited passage and the surrounding report text to gauge topical alignment.
- Fact Check – Compares factual statements in the report against the retrieved source using a rubric‑driven LLM‑as‑a‑judge, calibrated against a small human‑annotated set.
Scoring & Aggregation – Scores are averaged per model and per dimension, enabling direct comparison across the 14 systems.

The entire pipeline is open‑source, containerized, and can be run on a modest GPU‑enabled workstation, making it practical for both academia and industry teams.

Results & Findings

Model Category	Link Works	Relevant Content	Fact Check
Frontier closed‑source (e.g., GPT‑4, Claude)	94‑98%	81‑86%	39‑57%
Strong open‑source (e.g., Llama‑2‑70B)	92‑95%	78‑82%	45‑63%
Smaller open‑source (≤13B)	85‑90%	70‑75%	39‑48%

Citation surface quality is high: most models reliably produce reachable URLs and generally cite on‑topic material.
Factual reliability lags: even the best models get the facts right in only about half of the citations.
Depth hurts accuracy: when a model makes many tool calls (up to 150), its Fact Check score drops by ~42% compared with a shallow 2‑call setting.
One‑shot success rate: fewer than 50 % of the open‑source models can produce a fully cited report without additional prompting tricks.

Practical Implications

Tool‑augmented agents need tighter verification loops – Simply pulling more documents does not guarantee better answers; developers should embed fact‑checking after each retrieval step.
Automated report generators (e.g., for compliance, market analysis, or academic assistance) must expose source verification UI so end‑users can see whether a citation is reachable, relevant, and factually accurate.
LLM‑as‑a‑judge pipelines can be integrated into CI/CD for AI‑generated content, automatically flagging low‑accuracy citations before deployment.
Open‑source model selection – Teams that need verifiable citations should favor larger, well‑tuned open‑source models and invest in post‑generation validation rather than relying on raw generation.
Regulatory compliance – Industries with strict audit trails (finance, pharma, legal) can use the provided framework to certify that AI‑generated documents meet citation standards, reducing liability.

Limitations & Future Work

Source accessibility bias – Paywalled or dynamically generated pages often fail the Link Works check, potentially penalizing models that cite high‑quality but restricted sources.
Rubric calibration – The Fact Check dimension relies on LLM judges calibrated on a modest human set; broader human validation could improve reliability.
Domain coverage – Experiments focus on general‑web sources; specialized domains (e.g., scientific literature behind DOI paywalls) may exhibit different patterns.
Scalability of retrieval – The current pipeline fetches each URL sequentially; parallelization and caching strategies are needed for large‑scale production use.

Future research directions include extending the parser to handle citation styles beyond Markdown, integrating external fact‑checking APIs, and exploring reinforcement‑learning loops where the agent iteratively refines citations based on verification feedback.

Authors

Hailey Onweller
Elias Lumer
Austin Huber
Pia Ramchandani
Vamse Kumar Subbiah
Corey Feld

Paper Information

arXiv ID: 2605.06635v1
Categories: cs.CL
Published: May 7, 2026
PDF: Download PDF

[Paper] Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

[Paper] Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation