[Paper] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?
Source: arXiv - 2511.21401v1
Overview
Misinformation often spreads in the comment sections of online news articles, and fact‑checkers need more than just a “yes/no” verdict—they need concrete evidence from reliable sources to back up or refute each claim. This paper tackles the fine‑grained evidence extraction problem for Czech and Slovak claims, building a new human‑annotated dataset and testing how well current large language models (LLMs) can reproduce the exact evidence spans that humans would select.
Key Contributions
- New multilingual dataset: 2‑way (support/contradict) fine‑grained evidence annotations for Czech and Slovak claims, created by paid annotators and released for research.
- Comprehensive LLM benchmark: Evaluation of eight open‑source LLMs ranging from 8 B to 120 B parameters on the evidence‑extraction task.
- Error‑type analysis: Identification of the most common failure modes (e.g., paraphrasing instead of verbatim copying, missing spans, hallucinated evidence).
- Insights on size vs. alignment: Demonstrates that a modest 8 B model (llama3.1‑8B) can outperform much larger models (gpt‑oss‑120B) when it comes to matching human‑selected evidence.
- Practical guidance: Highlights which model families (Qwen‑3, DeepSeek‑R1, GPT‑OSS‑20B) strike the best balance between parameter count and evidence‑extraction quality.
Methodology
-
Dataset construction
- Collected real‑world claims from Czech and Slovak news‑article comment threads.
- For each claim, retrieved a set of candidate documents (news articles, fact‑checking sites, etc.).
- Paid annotators marked the exact text spans that directly support or refute the claim, producing a binary “support/contradict” label plus the span boundaries.
-
Model prompting
- Each LLM received a prompt containing the claim and the retrieved source document.
- The task was phrased as: “Extract the exact sentence(s) that support or contradict the claim.”
- No chain‑of‑thought or few‑shot examples were used, keeping the setup comparable across models.
-
Evaluation metrics
- Exact Match (EM): Does the model’s output exactly match the human‑annotated span?
- F1 over token overlap: Allows partial credit when the model captures most of the span.
- Invalid‑output rate: Percentage of cases where the model returns a paraphrase, a summary, or no span at all.
-
Error analysis
- Categorized mismatches into “copy‑error”, “span‑shift”, “hallucination”, and “no‑output”.
Results & Findings
| Model (size) | Exact Match | F1 | Invalid‑output |
|---|---|---|---|
| llama3.1‑8B | 38 % | 62 % | 12 % |
| qwen3‑14B | 35 % | 60 % | 14 % |
| deepseek‑r1‑32B | 34 % | 59 % | 15 % |
| gpt‑oss‑20B | 33 % | 58 % | 16 % |
| gpt‑oss‑120B | 27 % | 53 % | 28 % |
- Copy fidelity matters: The biggest source of error was the model paraphrasing the evidence instead of copying it verbatim, which the evaluation penalized heavily.
- Size isn’t everything: The 8 B llama3.1 model achieved the highest exact‑match rate, while the 120 B GPT‑OSS model suffered from a high invalid‑output rate, suggesting that alignment and training data quality outweigh sheer parameter count for this task.
- Balanced performers: Qwen‑3‑14B, DeepSeek‑R1‑32B, and GPT‑OSS‑20B offered a sweet spot—reasonable exact‑match scores with relatively low invalid‑output rates.
Practical Implications
- Fact‑checking pipelines: Integrating an LLM that reliably extracts verbatim evidence can automate the “evidence‑gathering” step, freeing human reviewers to focus on higher‑level reasoning.
- Multilingual moderation tools: The dataset and findings show that effective evidence extraction is feasible for less‑resourced languages like Czech and Slovak, encouraging developers to extend moderation bots beyond English.
- Model selection guidance: For teams building evidence‑based verification services, opting for a well‑aligned mid‑size model (e.g., Qwen‑3‑14B) may yield better ROI than deploying a massive, but poorly aligned, model.
- Prompt engineering: The study underscores the need for prompts that explicitly request exact spans, possibly combined with post‑processing checks (e.g., string‑matching against the source) to filter out paraphrases.
Limitations & Future Work
- Domain restriction: The dataset focuses on news‑article comments; performance may differ on social‑media posts, forums, or longer-form texts.
- Evaluation bias: Exact‑match scoring penalizes legitimate paraphrases that preserve factual content, so the metric may under‑represent useful model outputs.
- Model diversity: Only open‑source LLMs were tested; proprietary models (e.g., Claude, Gemini) could behave differently.
Future directions
- Expand the dataset to cover additional Slavic languages and broader domains.
- Explore hybrid approaches that combine LLM extraction with retrieval‑augmented generation to improve copy fidelity.
- Develop training objectives that explicitly reward verbatim evidence copying, reducing the paraphrasing error mode.
Authors
- Antonín Jarolím
- Martin Fajčík
- Lucia Makaiová
Paper Information
- arXiv ID: 2511.21401v1
- Categories: cs.CL
- Published: November 26, 2025
- PDF: Download PDF