[Paper] Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Published: (February 27, 2026 at 01:49 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24277v1

Overview

The paper presents a new suite of resources for evaluating assistive Retrieval‑Augmented Generation (RAG) systems that help readers judge the trustworthiness of online news. Built around the TREC 2025 DRAGUN track, the authors release datasets, rubrics, and an automated judging tool that make it easy for researchers and developers to benchmark and improve such systems.

Key Contributions

  • Two reusable tasks:
    1. Question Generation – systems must output a ranked list of 10 investigative questions for a news article.
    2. Report Generation – systems must produce a concise (≈250‑word), well‑attributed report grounded in the MS MARCO V2.1 Segmented Corpus.
  • Human‑crafted importance‑weighted rubrics for 30 news articles, defining the “gold‑standard” information needed to assess article credibility.
  • AutoJudge, an automated evaluation pipeline that scores new system runs against the rubrics, achieving strong correlation with the original TREC human judgments (Kendall’s τ = 0.678 for questions, τ = 0.872 for reports).
  • Open‑source release of all data, rubrics, and evaluation code, enabling reproducible research and rapid prototyping of assistive news‑trust tools.

Methodology

  1. Task Design – Participants received a news article and were asked to (a) generate investigative questions that a skeptical reader would ask, and (b) synthesize a short report that cites evidence from a large passage collection (MS MARCO).
  2. Human Rubric Creation – TREC assessors read each article, identified the most critical facts for trust assessment, and wrote short answer expectations for each question. Each rubric entry carries an importance weight reflecting how vital the fact is.
  3. Manual Evaluation – For the original track, assessors compared system outputs to the rubrics, awarding scores based on relevance, correctness, and attribution.
  4. Automated Judging (AutoJudge) – The authors built a pipeline that:
    • Retrieves the expected short answers from the rubrics.
    • Uses a combination of lexical overlap, semantic similarity (via a pretrained language model), and citation matching to score system outputs.
    • Aggregates weighted scores to produce a final ranking.
  5. Correlation Analysis – They measured how well AutoJudge’s rankings matched the human rankings using Kendall’s τ, demonstrating that the automated metric is a reliable proxy for human assessment.

Results & Findings

  • Question Generation: AutoJudge’s rankings correlated with human rankings at τ = 0.678, indicating a solid agreement despite the open‑ended nature of question quality.
  • Report Generation: Correlation rose to τ = 0.872, showing that the automated metric captures the nuances of factual grounding and attribution very well.
  • Reusability: The released rubrics and AutoJudge can evaluate any new system without needing fresh human judgments, dramatically lowering the cost of iterative development.

Practical Implications

  • Developer Toolkits – Teams building browser extensions, news‑aggregators, or AI assistants can plug in AutoJudge to automatically benchmark how well their RAG models surface critical trust‑related information.
  • Rapid Prototyping – Researchers can iterate on prompting strategies, retrieval pipelines, or citation mechanisms and get immediate, comparable feedback against a human‑validated baseline.
  • Industry Standards – Media platforms looking to add “trust‑score” overlays could adopt the rubric‑based evaluation as part of their quality‑control pipeline, ensuring that AI‑generated summaries are both factual and transparent.
  • Educational Use – Journalism schools can use the question‑generation task to teach students how to interrogate sources, while the report‑generation rubric serves as a checklist for fact‑checking workflows.

Limitations & Future Work

  • Scope of Rubrics – Only 30 articles were manually annotated, which may limit coverage of diverse topics, languages, and writing styles. Expanding the rubric set would improve generalizability.
  • Reliance on MS MARCO – Grounding reports on a single passage corpus could bias systems toward that source; future work should explore multi‑source grounding (e.g., fact‑checking databases, social‑media streams).
  • Semantic Evaluation Gaps – While AutoJudge correlates well with human scores, it still struggles with nuanced reasoning or detecting subtle bias; integrating more advanced reasoning models could close this gap.
  • User‑Centric Validation – The current evaluation focuses on assessor judgments. Field studies measuring actual reader trust after interacting with assistive RAG outputs would provide stronger evidence of real‑world impact.

Authors

  • Dake Zhang
  • Mark D. Smucker
  • Charles L. A. Clarke

Paper Information

  • arXiv ID: 2602.24277v1
  • Categories: cs.IR, cs.AI
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »