[Paper] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Published: (November 26, 2025 at 08:51 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21401v1

Overview

Misinformation often spreads in the comment sections of online news articles, and fact‑checkers need more than just a “yes/no” verdict—they need concrete evidence from reliable sources to back up or refute each claim. This paper tackles the fine‑grained evidence extraction problem for Czech and Slovak claims, building a new human‑annotated dataset and testing how well current large language models (LLMs) can reproduce the exact evidence spans that humans would select.

Key Contributions

  • New multilingual dataset: 2‑way (support/contradict) fine‑grained evidence annotations for Czech and Slovak claims, created by paid annotators and released for research.
  • Comprehensive LLM benchmark: Evaluation of eight open‑source LLMs ranging from 8 B to 120 B parameters on the evidence‑extraction task.
  • Error‑type analysis: Identification of the most common failure modes (e.g., paraphrasing instead of verbatim copying, missing spans, hallucinated evidence).
  • Insights on size vs. alignment: Demonstrates that a modest 8 B model (llama3.1‑8B) can outperform much larger models (gpt‑oss‑120B) when it comes to matching human‑selected evidence.
  • Practical guidance: Highlights which model families (Qwen‑3, DeepSeek‑R1, GPT‑OSS‑20B) strike the best balance between parameter count and evidence‑extraction quality.

Methodology

  1. Dataset construction

    • Collected real‑world claims from Czech and Slovak news‑article comment threads.
    • For each claim, retrieved a set of candidate documents (news articles, fact‑checking sites, etc.).
    • Paid annotators marked the exact text spans that directly support or refute the claim, producing a binary “support/contradict” label plus the span boundaries.
  2. Model prompting

    • Each LLM received a prompt containing the claim and the retrieved source document.
    • The task was phrased as: “Extract the exact sentence(s) that support or contradict the claim.”
    • No chain‑of‑thought or few‑shot examples were used, keeping the setup comparable across models.
  3. Evaluation metrics

    • Exact Match (EM): Does the model’s output exactly match the human‑annotated span?
    • F1 over token overlap: Allows partial credit when the model captures most of the span.
    • Invalid‑output rate: Percentage of cases where the model returns a paraphrase, a summary, or no span at all.
  4. Error analysis

    • Categorized mismatches into “copy‑error”, “span‑shift”, “hallucination”, and “no‑output”.

Results & Findings

Model (size)Exact MatchF1Invalid‑output
llama3.1‑8B38 %62 %12 %
qwen3‑14B35 %60 %14 %
deepseek‑r1‑32B34 %59 %15 %
gpt‑oss‑20B33 %58 %16 %
gpt‑oss‑120B27 %53 %28 %
  • Copy fidelity matters: The biggest source of error was the model paraphrasing the evidence instead of copying it verbatim, which the evaluation penalized heavily.
  • Size isn’t everything: The 8 B llama3.1 model achieved the highest exact‑match rate, while the 120 B GPT‑OSS model suffered from a high invalid‑output rate, suggesting that alignment and training data quality outweigh sheer parameter count for this task.
  • Balanced performers: Qwen‑3‑14B, DeepSeek‑R1‑32B, and GPT‑OSS‑20B offered a sweet spot—reasonable exact‑match scores with relatively low invalid‑output rates.

Practical Implications

  • Fact‑checking pipelines: Integrating an LLM that reliably extracts verbatim evidence can automate the “evidence‑gathering” step, freeing human reviewers to focus on higher‑level reasoning.
  • Multilingual moderation tools: The dataset and findings show that effective evidence extraction is feasible for less‑resourced languages like Czech and Slovak, encouraging developers to extend moderation bots beyond English.
  • Model selection guidance: For teams building evidence‑based verification services, opting for a well‑aligned mid‑size model (e.g., Qwen‑3‑14B) may yield better ROI than deploying a massive, but poorly aligned, model.
  • Prompt engineering: The study underscores the need for prompts that explicitly request exact spans, possibly combined with post‑processing checks (e.g., string‑matching against the source) to filter out paraphrases.

Limitations & Future Work

  • Domain restriction: The dataset focuses on news‑article comments; performance may differ on social‑media posts, forums, or longer-form texts.
  • Evaluation bias: Exact‑match scoring penalizes legitimate paraphrases that preserve factual content, so the metric may under‑represent useful model outputs.
  • Model diversity: Only open‑source LLMs were tested; proprietary models (e.g., Claude, Gemini) could behave differently.

Future directions

  • Expand the dataset to cover additional Slavic languages and broader domains.
  • Explore hybrid approaches that combine LLM extraction with retrieval‑augmented generation to improve copy fidelity.
  • Develop training objectives that explicitly reward verbatim evidence copying, reducing the paraphrasing error mode.

Authors

  • Antonín Jarolím
  • Martin Fajčík
  • Lucia Makaiová

Paper Information

  • arXiv ID: 2511.21401v1
  • Categories: cs.CL
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »