[Paper] MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval

Published: 2 months ago (February 10, 2026 at 12:44 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10023v1

Overview

The paper MEVER tackles a core challenge in automated fact‑checking: verifying claims that rely on both text and images (e.g., a caption describing a chart). It introduces a unified system that not only pulls the right multimodal evidence, but also decides whether a claim is true and generates a human‑readable explanation. By adding a new scientific‑domain benchmark (AIChartClaim), the authors show that their approach works beyond generic news data.

Key Contributions

Joint multimodal evidence retrieval using a two‑layer graph that links claims, textual snippets, and images, enabling image‑to‑text and text‑to‑image reasoning.
Token‑ and evidence‑level fusion architecture that combines claim embeddings with multimodal evidence representations for more accurate verification.
Explainable output via a “Fusion‑in‑Decoder” module that produces natural‑language rationales grounded in the retrieved evidence.
AIChartClaim dataset, a curated collection of AI‑research‑paper claims paired with chart images and supporting text, filling a gap in scientific claim‑verification resources.
Comprehensive evaluation demonstrating state‑of‑the‑art performance on both existing general‑domain benchmarks and the new scientific benchmark.

Methodology

Graph Construction – For each claim, the system builds a bipartite graph: one side holds textual evidence (sentences, captions), the other side holds visual evidence (charts, figures). Edges are weighted by cross‑modal similarity scores computed with pretrained encoders (e.g., CLIP for image‑text alignment).
Two‑Layer Retrieval
- Layer 1: Retrieve a coarse set of candidate texts and images based on claim‑to‑evidence similarity.
- Layer 2: Refine the candidate set by propagating relevance scores across the graph, allowing an image to boost related text and vice‑versa (image‑to‑text and text‑to‑image reasoning).
Verification Fusion
- Token‑level: The claim tokens are fused with tokenized evidence using cross‑attention, letting the model attend to the most informative words/pixels.
- Evidence‑level: Whole‑sentence and whole‑image embeddings are aggregated (via gated attention) to produce a compact multimodal representation that feeds into a classifier (true/false).
Explanation Generation – The decoder receives the fused multimodal context (the same embeddings used for verification) and generates a textual justification. The “Fusion‑in‑Decoder” design ensures the explanation is directly tied to the evidence that drove the decision.

All components are trained end‑to‑end with a multi‑task loss (retrieval, verification, explanation), encouraging the model to align evidence selection with the final verdict and its rationale.

Results & Findings

Dataset	Verification Accuracy ↑	Explanation BLEU ↑
FEVER‑MM (general)	84.7% (vs. 78.3% prior)	21.4 (vs. 16.9)
AIChartClaim (scientific)	78.2% (vs. 70.1% prior)	18.7 (vs. 13.5)

The graph‑based retrieval improves recall of relevant multimodal evidence by ~12% over baseline TF‑IDF + CLIP retrieval.
Token‑level fusion yields a noticeable boost for claims that hinge on fine‑grained textual cues (e.g., “the trend line is upward”).
Explanation quality correlates strongly with verification accuracy, confirming that better evidence selection leads to more faithful rationales.

Ablation studies show that removing either the graph layer or the Fusion‑in‑Decoder drops performance by >5%, underscoring the importance of each module.

Practical Implications

Fact‑checking pipelines for AI research – Developers building tools to audit scientific papers (e.g., for reproducibility checks) can plug MEVER’s retrieval and verification modules to automatically flag dubious chart‑based claims.
Content moderation on social platforms – When users share memes or infographics, MEVER can jointly analyze the caption and the image to detect misinformation, providing moderators with a concise justification.
Explainable AI for compliance – Enterprises needing audit trails (e.g., financial reporting) can use the generated explanations to satisfy regulatory requirements that demand “why” a claim was accepted or rejected.
Dataset creation – The AIChartClaim pipeline demonstrates a reproducible way to harvest claim‑evidence pairs from scientific PDFs, enabling other domains (medicine, climate) to build similar benchmarks.

Because the system is end‑to‑end trainable and relies on publicly available encoders (BERT, CLIP), developers can fine‑tune MEVER on domain‑specific corpora without rebuilding the whole architecture.

Limitations & Future Work

Domain transfer – While AIChartClaim shows promising results in AI research, performance on highly specialized visual domains (e.g., medical imaging) remains untested.
Scalability of graph retrieval – The two‑layer graph grows quadratically with the number of candidate evidence pieces; approximate nearest‑neighbor tricks are needed for large‑scale deployments.
Explanation fidelity – BLEU scores improve, but human evaluations reveal occasional “hallucinated” rationales that mention evidence not actually retrieved.
Future directions suggested by the authors include:
1. Integrating structured data (tables, code snippets) into the multimodal graph.
2. Exploring contrastive training to further align explanations with evidence.
3. Applying reinforcement learning to optimize the trade‑off between retrieval cost and verification accuracy.

Authors

Delvin Ce Zhang
Suhan Cui
Zhelin Chu
Xianren Zhang
Dongwon Lee

Paper Information

arXiv ID: 2602.10023v1
Categories: cs.CL
Published: February 10, 2026
PDF: Download PDF

[Paper] MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report