[Paper] An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
Source: arXiv - 2602.21059v1
Overview
Large Language Models (LLMs) are increasingly being used for scholarly question‑answering—think of a researcher asking a model to summarize a paper or retrieve a specific method. While these systems are powerful, their answers can be subtly wrong, and existing automated metrics often miss the kinds of mistakes that real scientists care about. This paper introduces a rigorously built “expert schema” that captures how domain scientists actually spot and classify LLM errors, offering a more nuanced way to evaluate scholarly QA tools.
Key Contributions
- A 20‑item error taxonomy organized into seven high‑level categories (e.g., factual inaccuracy, methodological misinterpretation, value‑bias, etc.).
- Empirical grounding: taxonomy derived from thematic analysis of 68 real question‑answer pairs across multiple scientific domains.
- Validation with practitioners: contextual inquiries with 10 additional scientists showed the schema surfaces errors they naturally notice and also uncovers hidden problems.
- Insight into expert assessment strategies: identified three systematic approaches—technical precision testing, value‑based evaluation, and meta‑evaluation of the evaluation process itself.
- Design implications: outlines how schema‑driven tools could personalize error‑checking workflows for users with different expertise levels.
Methodology
- Data collection – Researchers gathered 68 question‑answer pairs from existing scholarly QA systems (e.g., GPT‑4‑based search assistants).
- Thematic analysis – A team of domain experts (biologists, chemists, computer scientists) coded the LLM outputs, iteratively clustering recurring problems into patterns.
- Schema construction – The 20 error patterns were grouped into seven broader categories, each with clear definitions and examples.
- Validation via contextual inquiry – Ten scientists from unrelated labs used the schema while reviewing new QA outputs. Researchers observed how participants applied the categories, noted any missing patterns, and recorded how the schema changed their error‑detection behavior.
- Iterative refinement – Feedback from the inquiries was used to tweak definitions and add missing nuances, resulting in the final expert schema.
Results & Findings
- High coverage: The 20‑pattern schema captured 94 % of the errors identified by participants, compared to <60 % captured by standard automated metrics (BLEU, ROUGE, factuality scores).
- Error‑detection boost: When using the schema, scientists uncovered on average 2.3 additional errors per answer that they had initially missed.
- Assessment strategies: Participants consistently followed a three‑step routine—first checking raw factual precision, then evaluating whether the answer aligns with disciplinary values (e.g., reproducibility standards), and finally reflecting on their own evaluation criteria.
- User perception: Scientists reported that the schema made the evaluation process feel “more systematic” and “closer to how we peer‑review papers.”
Practical Implications
- Better QA tooling: Embedding the schema into scholarly assistants can give developers a checklist UI, prompting users to verify specific error dimensions rather than relying on a single “confidence score.”
- Personalized evaluation assistants: Because the schema maps to distinct expert strategies, tools can adapt the checklist based on a user’s role (e.g., junior researcher vs. senior reviewer) and domain, reducing cognitive load.
- Improved benchmarking: Researchers building new LLM‑based scholarly services can adopt the taxonomy for human‑in‑the‑loop evaluation, yielding more realistic performance numbers that matter to end‑users.
- Compliance & audit trails: In regulated fields (e.g., biomedical research), the schema provides a documented, standards‑aligned way to certify that AI‑generated answers meet domain‑specific rigor.
- Training data curation: Error categories can guide the creation of targeted fine‑tuning datasets—e.g., feeding the model more examples that address “methodological misinterpretation” errors.
Limitations & Future Work
- Domain scope: The study focused on a limited set of scientific disciplines; extending the taxonomy to fields with different epistemic cultures (e.g., humanities) may uncover new error patterns.
- Scalability of manual schema use: While the schema improves human detection, applying it at large scale still requires expert time; future work should explore semi‑automated classification (e.g., prompting LLMs to flag errors according to the schema).
- Dynamic LLM behavior: As models evolve, error distributions may shift; the taxonomy will need periodic re‑validation.
- Tool integration: The paper proposes design directions but does not present a concrete implementation; building and user‑testing a prototype evaluation UI is a natural next step.
Bottom line: By codifying how scientists actually spot LLM mistakes, this research gives developers a practical roadmap for building more trustworthy scholarly QA systems—moving the field beyond “big‑score” metrics toward human‑centered reliability.
Authors
- Anna Martin-Boyle
- William Humphreys
- Martha Brown
- Cara Leckey
- Harmanpreet Kaur
Paper Information
- arXiv ID: 2602.21059v1
- Categories: cs.HC, cs.CL
- Published: February 24, 2026
- PDF: Download PDF