[Paper] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Published: 2 months ago (November 26, 2025 at 08:51 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21401v1

Overview

Misinformation often spreads in the comment sections of online news articles, and fact‑checkers need more than just a “yes/no” verdict—they need concrete evidence from reliable sources to back up or refute each claim. This paper tackles the fine‑grained evidence extraction problem for Czech and Slovak claims, building a new human‑annotated dataset and testing how well current large language models (LLMs) can reproduce the exact evidence spans that humans would select.

Key Contributions

New multilingual dataset: 2‑way (support/contradict) fine‑grained evidence annotations for Czech and Slovak claims, created by paid annotators and released for research.
Comprehensive LLM benchmark: Evaluation of eight open‑source LLMs ranging from 8 B to 120 B parameters on the evidence‑extraction task.
Error‑type analysis: Identification of the most common failure modes (e.g., paraphrasing instead of verbatim copying, missing spans, hallucinated evidence).
Insights on size vs. alignment: Demonstrates that a modest 8 B model (llama3.1‑8B) can outperform much larger models (gpt‑oss‑120B) when it comes to matching human‑selected evidence.
Practical guidance: Highlights which model families (Qwen‑3, DeepSeek‑R1, GPT‑OSS‑20B) strike the best balance between parameter count and evidence‑extraction quality.

Methodology

Dataset construction
- Collected real‑world claims from Czech and Slovak news‑article comment threads.
- For each claim, retrieved a set of candidate documents (news articles, fact‑checking sites, etc.).
- Paid annotators marked the exact text spans that directly support or refute the claim, producing a binary “support/contradict” label plus the span boundaries.
Model prompting
- Each LLM received a prompt containing the claim and the retrieved source document.
- The task was phrased as: “Extract the exact sentence(s) that support or contradict the claim.”
- No chain‑of‑thought or few‑shot examples were used, keeping the setup comparable across models.
Evaluation metrics
- Exact Match (EM): Does the model’s output exactly match the human‑annotated span?
- F1 over token overlap: Allows partial credit when the model captures most of the span.
- Invalid‑output rate: Percentage of cases where the model returns a paraphrase, a summary, or no span at all.
Error analysis
- Categorized mismatches into “copy‑error”, “span‑shift”, “hallucination”, and “no‑output”.

Results & Findings

Model (size)	Exact Match	F1	Invalid‑output
llama3.1‑8B	38 %	62 %	12 %
qwen3‑14B	35 %	60 %	14 %
deepseek‑r1‑32B	34 %	59 %	15 %
gpt‑oss‑20B	33 %	58 %	16 %
gpt‑oss‑120B	27 %	53 %	28 %

Copy fidelity matters: The biggest source of error was the model paraphrasing the evidence instead of copying it verbatim, which the evaluation penalized heavily.
Size isn’t everything: The 8 B llama3.1 model achieved the highest exact‑match rate, while the 120 B GPT‑OSS model suffered from a high invalid‑output rate, suggesting that alignment and training data quality outweigh sheer parameter count for this task.
Balanced performers: Qwen‑3‑14B, DeepSeek‑R1‑32B, and GPT‑OSS‑20B offered a sweet spot—reasonable exact‑match scores with relatively low invalid‑output rates.

Practical Implications

Fact‑checking pipelines: Integrating an LLM that reliably extracts verbatim evidence can automate the “evidence‑gathering” step, freeing human reviewers to focus on higher‑level reasoning.
Multilingual moderation tools: The dataset and findings show that effective evidence extraction is feasible for less‑resourced languages like Czech and Slovak, encouraging developers to extend moderation bots beyond English.
Model selection guidance: For teams building evidence‑based verification services, opting for a well‑aligned mid‑size model (e.g., Qwen‑3‑14B) may yield better ROI than deploying a massive, but poorly aligned, model.
Prompt engineering: The study underscores the need for prompts that explicitly request exact spans, possibly combined with post‑processing checks (e.g., string‑matching against the source) to filter out paraphrases.

Limitations & Future Work

Domain restriction: The dataset focuses on news‑article comments; performance may differ on social‑media posts, forums, or longer-form texts.
Evaluation bias: Exact‑match scoring penalizes legitimate paraphrases that preserve factual content, so the metric may under‑represent useful model outputs.
Model diversity: Only open‑source LLMs were tested; proprietary models (e.g., Claude, Gemini) could behave differently.

Future directions

Expand the dataset to cover additional Slavic languages and broader domains.
Explore hybrid approaches that combine LLM extraction with retrieval‑augmented generation to improve copy fidelity.
Develop training objectives that explicitly reward verbatim evidence copying, reducing the paraphrasing error mode.

Authors

Antonín Jarolím
Martin Fajčík
Lucia Makaiová

Paper Information

arXiv ID: 2511.21401v1
Categories: cs.CL
Published: November 26, 2025
PDF: Download PDF

[Paper] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Future directions

Authors

Paper Information

Related posts

[Paper] CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Sycophancy is the first LLM 'dark pattern'

Why AI Alignment Starts With Better Evaluation

[Paper] Escaping the Verifier: Learning to Reason via Demonstrations