[Paper] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports
Source: arXiv - 2512.17776v1
Overview
The DEER benchmark tackles a growing pain point in the era of powerful large language models (LLMs): how do we reliably evaluate expert‑level research reports that these models can now generate? By combining a richly annotated set of 50 multi‑domain report‑writing tasks with a fine‑grained, expert‑grounded rubric and a full‑document fact‑checking pipeline, DEER offers the first systematic way to measure both the quality of reasoning and the factual reliability of AI‑produced research summaries.
Key Contributions
- Comprehensive benchmark: 50 report‑writing tasks covering 13 distinct research domains (e.g., medicine, law, computer science).
- Expert‑grounded evaluation taxonomy: 7 high‑level dimensions (e.g., Logical Coherence, Evidence Integration, Citation Quality) broken down into 25 sub‑dimensions and operationalized as 130 concrete rubric items.
- Task‑specific guidance for LLM judges: Prompt templates that steer language‑model evaluators to apply the rubric consistently, reducing variance across judgments.
- Document‑level fact‑checking architecture: End‑to‑end pipeline that extracts all claims (cited and uncited) from a report, searches external sources, and scores the reliability of the evidence supporting each claim.
- Strong correlation with human experts: Empirical validation shows DEER scores align closely with professional researcher assessments, while also providing interpretable diagnostics.
Methodology
- Task Design – Researchers curated 50 realistic research‑report prompts (e.g., “Write a systematic review on the safety of CRISPR‑based therapies”). Each prompt includes a brief background and a set of required sections (abstract, methodology, results, etc.).
- Rubric Construction – Domain experts defined 7 evaluation dimensions (such as Clarity, Methodological Rigor, Citation Coverage). Each dimension was split into fine‑grained sub‑dimensions, yielding 130 rubric items that can be answered with a Likert‑style score and optional free‑form comments.
- LLM Judge Prompting – For each rubric item, a prompt template supplies the report, the specific rubric description, and a short “expert guidance” note (e.g., “When scoring Evidence Integration, check whether the report explicitly links each claim to a cited source”). This helps the LLM act like a trained reviewer.
- Fact‑Checking Pipeline –
- Claim Extraction: A sequence‑to‑sequence model tags sentences and extracts proposition‑level claims.
- Evidence Retrieval: Claims are fed to a dense retriever (e.g., DPR) that pulls relevant documents from a curated corpus (academic papers, news, patents).
- Verification: A cross‑encoder classifier assesses whether the retrieved evidence supports, refutes, or is insufficient for each claim.
- Scoring: The pipeline aggregates per‑claim scores into a report‑wide factual reliability metric, also reporting the proportion of claims that are uncited yet verified.
- Validation – The authors collected human expert ratings on a subset of reports and computed Pearson/Spearman correlations with DEER’s automated scores.
Results & Findings
| Metric | Human Expert Avg. | DEER Automated Score | Correlation |
|---|---|---|---|
| Overall Quality (0‑5) | 4.2 | 4.1 | 0.88 |
| Logical Coherence | 4.5 | 4.4 | 0.91 |
| Evidence Integration | 4.0 | 3.9 | 0.86 |
| Fact‑Checking Accuracy (precision) | — | 0.82 | — |
| Claim Coverage (cited + uncited) | — | 96 % of claims processed | — |
- High alignment: The automated rubric scores tracked expert judgments across all seven dimensions, confirming that the LLM‑based judges can reliably apply the fine‑grained rubric.
- Diagnostic power: Systems that excelled at Logical Coherence often lagged on Citation Quality, revealing trade‑offs that raw BLEU‑style metrics would miss.
- Fact‑checking impact: Reports that omitted citations for 20 %+ of their claims suffered a noticeable drop in the overall DEER score, underscoring the importance of full‑document verification.
Practical Implications
- Benchmark for R&D teams: Companies building “research‑assistant” LLMs can use DEER to benchmark their models not just on fluency but on expert‑level rigor, helping prioritize improvements that matter to end‑users (e.g., scientists, policy analysts).
- Automated peer‑review aid: The fact‑checking pipeline can be integrated into manuscript‑submission platforms to flag unsupported statements before human reviewers even see the paper.
- Regulatory compliance: Industries with strict evidence standards (pharma, finance) can adopt DEER‑style checks to ensure AI‑generated reports meet documentation and audit requirements.
- Curriculum design for LLM fine‑tuning: The rubric’s 130 items provide a granular supervision signal; developers can fine‑tune models on “high‑quality” vs. “low‑quality” report pairs to directly improve weak dimensions.
Limitations & Future Work
- Domain coverage: While 13 domains are diverse, niche fields (e.g., quantum materials) are absent; extending DEER to more specialized corpora will test its generality.
- Reliance on external corpora: Fact‑checking quality depends on the breadth and freshness of the evidence database; rapidly evolving topics may suffer from incomplete retrieval.
- LLM judge bias: Even with expert guidance, LLM judges can inherit biases from their training data, potentially over‑rewarding stylistic flair over substantive depth.
- Scalability of human rubric creation: Crafting 130 rubric items required extensive expert effort; future work could explore semi‑automated rubric generation or adaptive item selection based on model performance.
DEER marks a significant step toward trustworthy, expert‑grade AI research assistants, offering both a rigorous evaluation framework and a practical fact‑checking engine that developers can adopt today.
Authors
- Janghoon Han
- Heegyu Kim
- Changho Lee
- Dahm Lee
- Min Hyung Park
- Hosung Song
- Stanley Jungkyu Choi
- Moontae Lee
- Honglak Lee
Paper Information
- arXiv ID: 2512.17776v1
- Categories: cs.CL
- Published: December 19, 2025
- PDF: Download PDF