[Paper] SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models
Source: arXiv - 2602.07833v1
Overview
The paper introduces SPD‑Faith Bench, a new diagnostic suite that probes whether multimodal large language models (MLLMs) actually see the images they reason about, rather than just spitting out plausible‑sounding chains of thought. By focusing on fine‑grained visual differences, the authors expose systematic faithfulness gaps in state‑of‑the‑art models and propose a lightweight fix called SAGE that improves visual grounding without any extra training.
Key Contributions
- SPD‑Faith Bench: a benchmark built around “spot‑the‑difference” tasks that require explicit visual comparison, isolating faithfulness from language priors.
- Failure‑mode analysis: identification of two recurring problems in current MLLMs – perceptual blindness (the model ignores visual cues) and perception‑reasoning dissociation (the model’s reasoning drifts away from what it actually perceives).
- Diagnostic tooling: probing methods that trace the root causes to decaying visual attention across transformer layers and representation shifts in the residual stream.
- SAGE framework: a train‑free, inference‑time wrapper that calibrates visual evidence, re‑routes attention to image patches, and aligns the reasoning trace with the visual input.
- Open resources: benchmark data, evaluation scripts, and the SAGE code are released publicly.
Methodology
- Benchmark design – The authors curate image pairs that differ by subtle visual attributes (e.g., color of a button, presence of a small object). Each query asks the model to explain why two images are different, forcing the generation of a step‑by‑step visual comparison.
- Faithfulness measurement – Instead of only checking answer correctness, they compare the model’s reasoning trace against a gold‑standard chain that references the exact visual evidence. Discrepancies indicate unfaithful reasoning.
- Model probing – Using attention roll‑outs and residual‑stream analysis, they monitor how visual tokens are attended to across transformer layers while the model generates its chain of thought.
- SAGE (Self‑Attention Guided Evidence) – At inference time, SAGE injects a calibrated visual mask derived from the model’s own attention scores, amplifying truly relevant patches and suppressing noise before the reasoning module runs. No gradient updates or fine‑tuning are required.
Results & Findings
| Model (baseline) | Accuracy on SPD‑Faith | Faithful‑Chain Score* |
|---|---|---|
| GPT‑4V (zero‑shot) | 68.2 % | 0.42 |
| LLaVA‑1.5‑13B | 61.5 % | 0.35 |
| MiniGPT‑4 | 55.8 % | 0.28 |
*Faithful‑Chain Score measures overlap between generated reasoning steps and the gold visual evidence (higher is better).
- Perceptual blindness: attention to the image quickly fades after the first few transformer layers, leading the model to rely on language priors.
- Perception‑reasoning dissociation: even when early layers attend to the right patches, later layers shift representations, causing the reasoning module to generate unrelated explanations.
- SAGE impact: applying SAGE raises the Faithful‑Chain Score by +0.18 on average across models, with modest gains in overall answer accuracy (+2–4 %). The improvement is achieved without any additional training data or compute.
Practical Implications
- More trustworthy AI assistants – Developers building visual chatbots (e.g., for e‑commerce or medical imaging) can integrate SAGE to ensure the model’s explanations truly reflect the image, reducing hallucinations that could mislead users.
- Debugging multimodal pipelines – The benchmark and probing tools give engineers a systematic way to spot where visual information is lost, guiding architecture tweaks (e.g., deeper visual encoders, better cross‑modal fusion).
- Regulatory compliance – For domains where explainability is mandated (finance, healthcare), a faithfulness metric like the one proposed helps satisfy audit requirements by proving that reasoning traces are grounded in observable data.
- Zero‑cost improvement – Since SAGE is train‑free, it can be dropped into existing inference services with minimal latency overhead, offering an immediate ROI for products already using MLLMs.
Limitations & Future Work
- Scope of visual differences – SPD‑Faith focuses on fine‑grained, deterministic changes; it does not cover high‑level semantic reasoning (e.g., scene understanding) where faithfulness may manifest differently.
- Model‑agnostic assumptions – SAGE relies on the presence of cross‑modal attention maps; models that fuse modalities earlier or use non‑transformer backbones may need adapted techniques.
- Scalability of probing – The detailed residual‑stream analysis is computationally heavy, limiting its use to research settings rather than large‑scale production monitoring.
- Future directions suggested by the authors include extending the benchmark to video, exploring automated faithfulness metrics that do not require gold chains, and integrating SAGE‑style calibration into training objectives for even stronger grounding.
Authors
- Weijiang Lv
- Yaoxuan Feng
- Xiaobo Xia
- Jiayu Wang
- Yan Jing
- Wenchao Chen
- Bo Chen
Paper Information
- arXiv ID: 2602.07833v1
- Categories: cs.CV, cs.AI, cs.CL
- Published: February 8, 2026
- PDF: Download PDF