[Paper] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Published: 1 month ago (December 19, 2025 at 11:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17776v1

Overview

The DEER benchmark tackles a growing pain point in the era of powerful large language models (LLMs): how do we reliably evaluate expert‑level research reports that these models can now generate? By combining a richly annotated set of 50 multi‑domain report‑writing tasks with a fine‑grained, expert‑grounded rubric and a full‑document fact‑checking pipeline, DEER offers the first systematic way to measure both the quality of reasoning and the factual reliability of AI‑produced research summaries.

Key Contributions

Comprehensive benchmark: 50 report‑writing tasks covering 13 distinct research domains (e.g., medicine, law, computer science).
Expert‑grounded evaluation taxonomy: 7 high‑level dimensions (e.g., Logical Coherence, Evidence Integration, Citation Quality) broken down into 25 sub‑dimensions and operationalized as 130 concrete rubric items.
Task‑specific guidance for LLM judges: Prompt templates that steer language‑model evaluators to apply the rubric consistently, reducing variance across judgments.
Document‑level fact‑checking architecture: End‑to‑end pipeline that extracts all claims (cited and uncited) from a report, searches external sources, and scores the reliability of the evidence supporting each claim.
Strong correlation with human experts: Empirical validation shows DEER scores align closely with professional researcher assessments, while also providing interpretable diagnostics.

Methodology

Task Design – Researchers curated 50 realistic research‑report prompts (e.g., “Write a systematic review on the safety of CRISPR‑based therapies”). Each prompt includes a brief background and a set of required sections (abstract, methodology, results, etc.).
Rubric Construction – Domain experts defined 7 evaluation dimensions (such as Clarity, Methodological Rigor, Citation Coverage). Each dimension was split into fine‑grained sub‑dimensions, yielding 130 rubric items that can be answered with a Likert‑style score and optional free‑form comments.
LLM Judge Prompting – For each rubric item, a prompt template supplies the report, the specific rubric description, and a short “expert guidance” note (e.g., “When scoring Evidence Integration, check whether the report explicitly links each claim to a cited source”). This helps the LLM act like a trained reviewer.
Fact‑Checking Pipeline –
- Claim Extraction: A sequence‑to‑sequence model tags sentences and extracts proposition‑level claims.
- Evidence Retrieval: Claims are fed to a dense retriever (e.g., DPR) that pulls relevant documents from a curated corpus (academic papers, news, patents).
- Verification: A cross‑encoder classifier assesses whether the retrieved evidence supports, refutes, or is insufficient for each claim.
- Scoring: The pipeline aggregates per‑claim scores into a report‑wide factual reliability metric, also reporting the proportion of claims that are uncited yet verified.
Validation – The authors collected human expert ratings on a subset of reports and computed Pearson/Spearman correlations with DEER’s automated scores.

Results & Findings

Metric	Human Expert Avg.	DEER Automated Score	Correlation
Overall Quality (0‑5)	4.2	4.1	0.88
Logical Coherence	4.5	4.4	0.91
Evidence Integration	4.0	3.9	0.86
Fact‑Checking Accuracy (precision)	—	0.82	—
Claim Coverage (cited + uncited)	—	96 % of claims processed	—

High alignment: The automated rubric scores tracked expert judgments across all seven dimensions, confirming that the LLM‑based judges can reliably apply the fine‑grained rubric.
Diagnostic power: Systems that excelled at Logical Coherence often lagged on Citation Quality, revealing trade‑offs that raw BLEU‑style metrics would miss.
Fact‑checking impact: Reports that omitted citations for 20 %+ of their claims suffered a noticeable drop in the overall DEER score, underscoring the importance of full‑document verification.

Practical Implications

Benchmark for R&D teams: Companies building “research‑assistant” LLMs can use DEER to benchmark their models not just on fluency but on expert‑level rigor, helping prioritize improvements that matter to end‑users (e.g., scientists, policy analysts).
Automated peer‑review aid: The fact‑checking pipeline can be integrated into manuscript‑submission platforms to flag unsupported statements before human reviewers even see the paper.
Regulatory compliance: Industries with strict evidence standards (pharma, finance) can adopt DEER‑style checks to ensure AI‑generated reports meet documentation and audit requirements.
Curriculum design for LLM fine‑tuning: The rubric’s 130 items provide a granular supervision signal; developers can fine‑tune models on “high‑quality” vs. “low‑quality” report pairs to directly improve weak dimensions.

Limitations & Future Work

Domain coverage: While 13 domains are diverse, niche fields (e.g., quantum materials) are absent; extending DEER to more specialized corpora will test its generality.
Reliance on external corpora: Fact‑checking quality depends on the breadth and freshness of the evidence database; rapidly evolving topics may suffer from incomplete retrieval.
LLM judge bias: Even with expert guidance, LLM judges can inherit biases from their training data, potentially over‑rewarding stylistic flair over substantive depth.
Scalability of human rubric creation: Crafting 130 rubric items required extensive expert effort; future work could explore semi‑automated rubric generation or adaptive item selection based on model performance.

DEER marks a significant step toward trustworthy, expert‑grade AI research assistants, offering both a rigorous evaluation framework and a practical fact‑checking engine that developers can adopt today.

Authors

Janghoon Han
Heegyu Kim
Changho Lee
Dahm Lee
Min Hyung Park
Hosung Song
Stanley Jungkyu Choi
Moontae Lee
Honglak Lee

Paper Information

arXiv ID: 2512.17776v1
Categories: cs.CL
Published: December 19, 2025
PDF: Download PDF

[Paper] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora