[Paper] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Published: (December 19, 2025 at 11:46 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.17776v1

Overview

The DEER benchmark tackles a growing pain point in the era of powerful large language models (LLMs): how do we reliably evaluate expert‑level research reports that these models can now generate? By combining a richly annotated set of 50 multi‑domain report‑writing tasks with a fine‑grained, expert‑grounded rubric and a full‑document fact‑checking pipeline, DEER offers the first systematic way to measure both the quality of reasoning and the factual reliability of AI‑produced research summaries.

Key Contributions

  • Comprehensive benchmark: 50 report‑writing tasks covering 13 distinct research domains (e.g., medicine, law, computer science).
  • Expert‑grounded evaluation taxonomy: 7 high‑level dimensions (e.g., Logical Coherence, Evidence Integration, Citation Quality) broken down into 25 sub‑dimensions and operationalized as 130 concrete rubric items.
  • Task‑specific guidance for LLM judges: Prompt templates that steer language‑model evaluators to apply the rubric consistently, reducing variance across judgments.
  • Document‑level fact‑checking architecture: End‑to‑end pipeline that extracts all claims (cited and uncited) from a report, searches external sources, and scores the reliability of the evidence supporting each claim.
  • Strong correlation with human experts: Empirical validation shows DEER scores align closely with professional researcher assessments, while also providing interpretable diagnostics.

Methodology

  1. Task Design – Researchers curated 50 realistic research‑report prompts (e.g., “Write a systematic review on the safety of CRISPR‑based therapies”). Each prompt includes a brief background and a set of required sections (abstract, methodology, results, etc.).
  2. Rubric Construction – Domain experts defined 7 evaluation dimensions (such as Clarity, Methodological Rigor, Citation Coverage). Each dimension was split into fine‑grained sub‑dimensions, yielding 130 rubric items that can be answered with a Likert‑style score and optional free‑form comments.
  3. LLM Judge Prompting – For each rubric item, a prompt template supplies the report, the specific rubric description, and a short “expert guidance” note (e.g., “When scoring Evidence Integration, check whether the report explicitly links each claim to a cited source”). This helps the LLM act like a trained reviewer.
  4. Fact‑Checking Pipeline
    • Claim Extraction: A sequence‑to‑sequence model tags sentences and extracts proposition‑level claims.
    • Evidence Retrieval: Claims are fed to a dense retriever (e.g., DPR) that pulls relevant documents from a curated corpus (academic papers, news, patents).
    • Verification: A cross‑encoder classifier assesses whether the retrieved evidence supports, refutes, or is insufficient for each claim.
    • Scoring: The pipeline aggregates per‑claim scores into a report‑wide factual reliability metric, also reporting the proportion of claims that are uncited yet verified.
  5. Validation – The authors collected human expert ratings on a subset of reports and computed Pearson/Spearman correlations with DEER’s automated scores.

Results & Findings

MetricHuman Expert Avg.DEER Automated ScoreCorrelation
Overall Quality (0‑5)4.24.10.88
Logical Coherence4.54.40.91
Evidence Integration4.03.90.86
Fact‑Checking Accuracy (precision)0.82
Claim Coverage (cited + uncited)96 % of claims processed
  • High alignment: The automated rubric scores tracked expert judgments across all seven dimensions, confirming that the LLM‑based judges can reliably apply the fine‑grained rubric.
  • Diagnostic power: Systems that excelled at Logical Coherence often lagged on Citation Quality, revealing trade‑offs that raw BLEU‑style metrics would miss.
  • Fact‑checking impact: Reports that omitted citations for 20 %+ of their claims suffered a noticeable drop in the overall DEER score, underscoring the importance of full‑document verification.

Practical Implications

  • Benchmark for R&D teams: Companies building “research‑assistant” LLMs can use DEER to benchmark their models not just on fluency but on expert‑level rigor, helping prioritize improvements that matter to end‑users (e.g., scientists, policy analysts).
  • Automated peer‑review aid: The fact‑checking pipeline can be integrated into manuscript‑submission platforms to flag unsupported statements before human reviewers even see the paper.
  • Regulatory compliance: Industries with strict evidence standards (pharma, finance) can adopt DEER‑style checks to ensure AI‑generated reports meet documentation and audit requirements.
  • Curriculum design for LLM fine‑tuning: The rubric’s 130 items provide a granular supervision signal; developers can fine‑tune models on “high‑quality” vs. “low‑quality” report pairs to directly improve weak dimensions.

Limitations & Future Work

  • Domain coverage: While 13 domains are diverse, niche fields (e.g., quantum materials) are absent; extending DEER to more specialized corpora will test its generality.
  • Reliance on external corpora: Fact‑checking quality depends on the breadth and freshness of the evidence database; rapidly evolving topics may suffer from incomplete retrieval.
  • LLM judge bias: Even with expert guidance, LLM judges can inherit biases from their training data, potentially over‑rewarding stylistic flair over substantive depth.
  • Scalability of human rubric creation: Crafting 130 rubric items required extensive expert effort; future work could explore semi‑automated rubric generation or adaptive item selection based on model performance.

DEER marks a significant step toward trustworthy, expert‑grade AI research assistants, offering both a rigorous evaluation framework and a practical fact‑checking engine that developers can adopt today.

Authors

  • Janghoon Han
  • Heegyu Kim
  • Changho Lee
  • Dahm Lee
  • Min Hyung Park
  • Hosung Song
  • Stanley Jungkyu Choi
  • Moontae Lee
  • Honglak Lee

Paper Information

  • arXiv ID: 2512.17776v1
  • Categories: cs.CL
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...