[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

Published: (December 12, 2025 at 09:50 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11614v1

Overview

Retrieval‑augmented generation (RAG) systems combine a search component with a large language model (LLM) to produce answers grounded in external documents. However, most current pipelines treat the retrieved text as a “soft hint” rather than provable evidence, leading to hallucinations when the context is missing or misleading.
The paper “Bounding Hallucinations: Information‑Theoretic Guarantees for RAG Systems via Merlin‑Arthur Protocols” proposes a novel training regime that casts the whole RAG pipeline as an interactive proof system, giving the generator a principled way to accept only when the evidence truly supports its answer and to reject otherwise.

Key Contributions

  • Interactive‑Proof‑Style Supervision: Adapts the Merlin‑Arthur (M/A) protocol to RAG, where the generator (Arthur) learns from helpful evidence (Merlin) and adversarial, misleading evidence (Morgana).
  • Linear‑Time XAI Hook: Uses a fast explainability method to pinpoint the most influential evidence spans and to let Merlin/Morgana edit them on‑the‑fly during training.
  • Explained Information Fraction (EIF): A new metric that separates explanation fidelity from raw prediction error, normalizing mutual‑information guarantees against model capacity.
  • Retriever Boost via Hard Positives/Negatives: Generates automatic “hard” training examples for the retriever, improving recall and mean reciprocal rank (MRR) without human‑annotated unanswerable queries.
  • Empirical Validation: Shows consistent gains in groundedness, completeness, soundness, and reject behavior across three RAG benchmarks and two families of LLMs (small and large).

Methodology

  1. Setup the Proof Game

    • Arthur = the LLM generator.
    • Merlin = a helper that supplies correct evidence snippets.
    • Morgana = an adversary that injects incorrect or irrelevant snippets.
  2. Evidence‑Focused XAI

    • A lightweight attribution technique (e.g., gradient‑based token importance) runs in linear time to identify which retrieved passages most affect Arthur’s answer.
    • Merlin can replace low‑impact tokens with more supportive text; Morgana can corrupt the high‑impact tokens to create a “hard” negative.
  3. Training Loop

    • Arthur receives a question of unknown provenance and a mixed bag of evidence (some from Merlin, some from Morgana).
    • It is trained to:
      a) Answer when the evidence collectively supports a correct answer.
      b) Reject (output “I don’t know”) when the evidence is insufficient or contradictory.
      c) Ground its answer on the exact evidence spans identified by the XAI module.
  4. Evaluation Framework

    • Standard RAG metrics (accuracy, recall, MRR) are complemented with EIF, which quantifies how much of the mutual information between question, evidence, and answer is explained by the model’s attribution map.

Results & Findings

Dataset / ModelBaseline RAGM/A‑trained RAG
HotpotQA (BERT‑based)Groundedness 68 %78 % (+10 pp)
NaturalQuestions (GPT‑2)Reject‑rate (unanswerable) 22 %35 % (+13 pp)
FiQA (LLaMA‑7B)MRR 0.410.48 (+0.07)
Retriever Recall71 %78 % (+7 pp)
Explained Information Fraction (EIF)0.420.58 (+0.16)
  • Reduced Hallucinations: The proportion of answers that contradicted the supplied evidence dropped by ~30 % across all benchmarks.
  • Better Reject Behavior: The model learned to say “I don’t know” when evidence was ambiguous, a capability that previously required hand‑crafted unanswerable examples.
  • Retriever Gains: By feeding the retriever automatically generated hard positives/negatives, its top‑k recall improved without extra annotation cost.

Practical Implications

  • More Trustworthy Assistants: Developers building chat‑bots, code‑assistants, or knowledge‑base Q&A can rely on the system to refuse answering when the source material is insufficient, reducing the risk of misinformation.
  • Zero‑Shot Unanswerable Detection: No need to curate a separate “unanswerable” dataset; the M/A framework creates adversarial examples on the fly, saving annotation time and cost.
  • Plug‑and‑Play Retriever Upgrade: Existing retrievers can be fine‑tuned with the automatically generated hard examples, yielding immediate recall improvements.
  • Explainability‑Driven Debugging: Because the model’s answer is tied to specific evidence spans, developers can surface those spans in UI components, making it easier to audit and debug model behavior.
  • Scalable to Different Model Sizes: The approach works for both modest‑size (≈300 M) and large (≈7 B) LLMs, meaning startups and enterprises alike can adopt it without needing massive compute.

Limitations & Future Work

  • Linear‑Time XAI Approximation: The attribution method trades off some fidelity for speed; more precise (but slower) explainers could further tighten the EIF bound.
  • Benchmark Scope: Experiments focus on English QA datasets; cross‑lingual or multimodal retrieval (e.g., images, tables) remains untested.
  • Proof‑System Overhead: The adversarial training loop adds extra compute per epoch, which may be prohibitive for very large models without distributed training tricks.
  • Theoretical Guarantees vs. Real‑World Noise: The information‑theoretic guarantees assume well‑behaved retrieval distributions; noisy web‑scale corpora could weaken the soundness guarantees.

Future directions include extending the M/A protocol to multimodal retrieval, integrating stronger attribution methods, and exploring curriculum learning where the difficulty of Morgana’s attacks ramps up as Arthur improves.

Authors

  • Björn Deiseroth
  • Max Henning Höth
  • Kristian Kersting
  • Letitia Parcalabescu

Paper Information

  • arXiv ID: 2512.11614v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »