[Paper] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
Source: arXiv - 2512.10791v1
Overview
The paper presents The FACTS Leaderboard, a new, publicly‑available benchmark suite that measures how factually accurate large language models (LLMs) are across a wide range of real‑world tasks. By unifying four complementary sub‑benchmarks—multimodal QA, closed‑book knowledge, search‑augmented answering, and long‑form grounding—the authors aim to give developers a single, reliable score to compare models and track progress on factuality.
Key Contributions
- Unified factuality suite that aggregates performance on four distinct sub‑leaderboards, covering image‑based QA, parametric knowledge, retrieval‑augmented QA, and document‑grounded generation.
- Automated judging pipeline: each sub‑benchmark uses trained judge models (instead of costly human annotation) to score factuality at scale.
- Public leaderboard on Kaggle with both public and hidden test splits, enabling open competition while protecting against over‑fitting.
- Versioned Grounding benchmark (v2) with improved judge models that better detect hallucinations in long‑form text.
- Continuous maintenance plan: the suite will be updated with new data and tasks, encouraging long‑term community involvement.
Methodology
-
Dataset Construction – The authors curated four task‑specific datasets:
- FACTS Multimodal: image‑question pairs requiring visual reasoning.
- FACTS Parametric: factoid questions that must be answered from the model’s internal knowledge (no external lookup).
- FACTS Search: open‑ended queries where the model can call a simulated search API and must synthesize retrieved snippets.
- FACTS Grounding (v2): long passages paired with source documents; the model must produce answers that are verifiable against the provided texts.
-
Automated Judges – For each sub‑benchmark, a separate classifier (often a fine‑tuned LLM) predicts whether a response is factually correct. These judges were trained on a mixture of human‑annotated examples and synthetic perturbations to improve robustness.
-
Scoring & Aggregation – Individual judge scores are averaged per sub‑benchmark, then the four averages are combined (simple mean) to produce the overall FACTS suite score. This design balances strengths and weaknesses across modalities and retrieval settings.
-
Leaderboard Infrastructure – Submissions are evaluated on Kaggle’s platform. Public splits give immediate feedback; hidden splits ensure that final rankings reflect genuine generalization.
Results & Findings
- State‑of‑the‑art LLMs (e.g., GPT‑4, PaLM‑2) achieve high scores on Parametric and Search sub‑benchmarks but still lag behind on Multimodal and Grounding, indicating that visual reasoning and long‑form citation remain challenging.
- Retrieval‑augmented models outperform pure parametric ones on factuality, confirming that external knowledge sources can mitigate hallucinations when used correctly.
- The automated judges correlate strongly (Spearman ≈ 0.85) with human judgments on a held‑out validation set, suggesting that the scoring pipeline is reliable for large‑scale evaluation.
Practical Implications
- Model selection: Developers can use the FACTS suite score as a single metric to choose a model that best fits their factuality requirements, rather than juggling multiple ad‑hoc tests.
- Product monitoring: Companies building chatbots, search assistants, or document‑analysis tools can integrate the benchmark into their CI pipelines to catch regressions in factual accuracy before release.
- Fine‑tuning guidance: The four sub‑benchmarks highlight specific weak spots (e.g., multimodal reasoning), helping teams target data collection or architectural changes where they matter most.
- Retrieval‑augmented design: The clear advantage of Search scores encourages the adoption of retrieval modules (e.g., RAG, tool‑use APIs) in production systems to improve answer grounding.
- Community standards: By providing a shared, continuously updated leaderboard, the research community gains a common yardstick, reducing fragmented evaluations and accelerating progress on hallucination mitigation.
Limitations & Future Work
- Judge reliability: Although judges correlate well with humans, they can still be fooled by subtle factual errors or adversarial phrasing, so occasional human audits remain advisable.
- Domain coverage: The current datasets focus on general‑knowledge and English‑centric content; extending to specialized domains (medical, legal) and other languages is left for future releases.
- Static hidden splits: While hidden test sets protect against over‑fitting, they may become stale; the authors plan periodic refreshes to keep the benchmark challenging.
- Multimodal depth: The visual QA component currently uses single‑image questions; richer multimodal contexts (video, tables) are earmarked for upcoming versions.
The FACTS Leaderboard is now live on Kaggle (https://www.kaggle.com/benchmarks/google/facts). If you’re building LLM‑powered products, give it a spin and see where your model stands on factuality across the full spectrum of real‑world use cases.
Authors
- Aileen Cheng
- Alon Jacovi
- Amir Globerson
- Ben Golan
- Charles Kwong
- Chris Alberti
- Connie Tao
- Eyal Ben‑David
- Gaurav Singh Tomar
- Lukas Haas
- Yonatan Bitton
- Adam Bloniarz
- Aijun Bai
- Andrew Wang
- Anfal Siddiqui
- Arturo Bajuelos Castillo
- Aviel Atias
- Chang Liu
- Corey Fry
- Daniel Balle
- Deepanway Ghosal
- Doron Kukliansky
- Dror Marcus
- Elena Gribovskaya
- Eran Ofek
- Honglei Zhuang
- Itay Laish
- Jan Ackermann
- Lily Wang
- Meg Risdal
- Megan Barnes
- Michael Fink
- Mohamed Amin
- Moran Ambar
- Natan Potikha
- Nikita Gupta
- Nitzan Katz
- Noam Velan
- Ofir Roval
- Ori Ram
- Polina Zablotskaia
- Prathamesh Bang
- Priyanka Agrawal
- Rakesh Ghiya
- Sanjay Ganapathy
- Simon Baumgartner
- Sofia Erell
- Sushant Prakash
- Thibault Sellam
- Vikram Rao
- Xuanhui Wang
- Yaroslav Akulov
- Yulong Yang
- Zhen Yang
- Zhixin Lai
- Zhongru Wu
- Anca Dragan
- Avinatan Hassidim
- Fernando Pereira
- Slav Petrov
- Srinivasan Venkatachary
- Tulsee Doshi
- Yossi Matias
- Sasha Goldshtein
- Dipanjan Das
Paper Information
- arXiv ID: 2512.10791v1
- Categories: cs.CL, cs.AI
- Published: December 11, 2025
- PDF: Download PDF