[Paper] Eval Factsheets: A Structured Framework for Documenting AI Evaluations

Published: (December 3, 2025 at 01:46 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.04062v1

Overview

The paper introduces Eval Factsheets, a structured documentation framework designed to bring the same rigor that “Datasheets” and “Model Cards” brought to datasets and models—now to AI evaluation practices. By formalizing how we record the who, what, how, and why of benchmark runs, the authors aim to curb the reproducibility crisis and make it easier for engineers and product teams to compare and trust evaluation results.

Key Contributions

  • A unified taxonomy that captures evaluation details across five dimensions: Context, Scope, Structure, Method, and Alignment.
  • A concrete questionnaire (mandatory + recommended fields) that can be attached to any benchmark or evaluation pipeline.
  • Case‑study validation on a variety of modern benchmarks—including traditional test sets and emerging “LLM‑as‑judge” setups—showing the framework’s flexibility.
  • Open‑source tooling (templates and examples) to lower the adoption barrier for research labs and industry teams.
  • Guidelines for integration with existing documentation standards, encouraging ecosystem‑wide consistency.

Methodology

  1. Taxonomy Design – The authors surveyed a broad set of AI evaluation practices (image classification, language model prompting, reinforcement‑learning‑from‑human‑feedback, etc.) and distilled common reporting needs into five high‑level categories.
  2. Questionnaire Development – For each category they drafted specific fields (e.g., “Evaluator identity”, “Dataset version”, “Metric aggregation method”) and classified them as mandatory (must‑have for reproducibility) or recommended (adds nuance).
  3. Iterative Validation – The questionnaire was applied to several public benchmarks (GLUE, ImageNet‑V2, HELM, etc.). Feedback from the authors of those benchmarks was used to refine wording and coverage.
  4. Tooling – A lightweight Markdown/JSON schema was released so teams can generate Eval Factsheets automatically from their CI pipelines or experiment tracking systems.

The approach is deliberately non‑technical: rather than prescribing new statistical techniques, it focuses on metadata—the “who‑did‑what‑when‑how” that makes an evaluation understandable to a downstream developer.

Results & Findings

  • Coverage: The factsheets captured all critical aspects of 12 diverse benchmarks, from simple accuracy tables to complex multi‑turn LLM‑as‑judge pipelines.
  • Consistency: When two independent teams documented the same benchmark, their factsheets matched on 94 % of mandatory fields, demonstrating low ambiguity.
  • Reproducibility Boost: In a controlled replication study, providing the Eval Factsheet reduced the time to reproduce a benchmark’s results by ~30 % compared with the original paper’s description alone.
  • Stakeholder Insight: Surveyed engineers reported higher confidence in selecting a benchmark for model comparison after reading its factsheet, citing clearer “Alignment” (robustness, bias) information.

Practical Implications

  • For ML Engineers: Plug‑in an Eval Factsheet to your CI/CD pipeline; the generated document becomes a single source of truth for model evaluation, easing hand‑offs between data scientists, QA, and product owners.
  • For Product Managers: Quickly assess whether a benchmark aligns with product constraints (e.g., latency, fairness) without digging through dense methodology sections.
  • For Platform Builders: Incorporate the questionnaire schema into model‑hosting services (e.g., Hugging Face, Vertex AI) to surface evaluation provenance alongside model cards.
  • For Auditors & Regulators: A standardized factsheet simplifies compliance checks for AI transparency mandates, as the required “Alignment” fields map neatly to many emerging AI governance frameworks.
  • For Researchers: The framework encourages more thorough reporting, which can accelerate meta‑analysis and benchmark aggregation efforts (e.g., building a “benchmark zoo” with comparable metadata).

Limitations & Future Work

  • Adoption Hurdles: The framework relies on voluntary compliance; without community or industry mandates, uptake may be uneven.
  • Granularity Trade‑off: Some highly specialized evaluations (e.g., neurosymbolic reasoning) may need extra fields beyond the current questionnaire, suggesting the need for extensible plug‑ins.
  • Automation Gaps: While tooling exists, fully automatic extraction of all mandatory fields (especially “Context” details like evaluator expertise) still requires manual input.
  • Future Directions: The authors plan to integrate Eval Factsheets with experiment‑tracking platforms (e.g., MLflow, Weights & Biases), develop a validation suite for factsheet completeness, and explore a community‑driven registry of benchmark factsheets for cross‑benchmark comparisons.

Authors

  • Florian Bordes
  • Candace Ross
  • Justine T Kao
  • Evangelia Spiliopoulou
  • Adina Williams

Paper Information

  • arXiv ID: 2512.04062v1
  • Categories: cs.LG
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »