[Paper] Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

Published: (January 7, 2026 at 11:39 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04073v1

Overview

The paper investigates why large multimodal models (LMMs) that reason over video with chain‑of‑thought (CoT) prompts often get stuck on a wrong textual inference and ignore contradictory visual cues—a failure the authors call textual inertia. By systematically injecting logical perturbations into the models’ reasoning chains, the authors expose how rarely the models self‑correct and propose a training‑free inference technique that forces the model to re‑ground its thoughts in the visual stream, dramatically reducing hallucination propagation.

Key Contributions

  • Identification of “textual inertia” – a systematic failure mode where an early textual hallucination drives the rest of the reasoning, overriding visual evidence.
  • LogicGraph Perturbation Protocol – a benchmark that programmatically inserts logical inconsistencies into CoT sequences to probe self‑reflection across a variety of LMM architectures (native reasoning vs. prompt‑driven).
  • Comprehensive evaluation – shows that fewer than 10 % of perturbed cases are self‑corrected, confirming that most models blindly follow the initial error.
  • Active Visual‑Context Refinement (AVCR) – a training‑free inference framework that (1) actively re‑grounds each reasoning step in the visual input and (2) adaptively refines the textual context to filter out noise.
  • Empirical gains – AVCR cuts hallucination propagation by up to ~45 % and improves overall reasoning accuracy on several video‑question answering benchmarks.

Methodology

  1. LogicGraph Construction – For each video‑question pair, the authors build a directed graph representing the logical flow of a CoT answer. Nodes are intermediate textual statements; edges encode dependencies.
  2. Perturbation Injection – They flip the truth value of selected nodes (e.g., “the cat is red” → “the cat is blue”) and propagate the change downstream, creating a conflict between the altered text and the visual evidence.
  3. Model Families Tested
    • Native‑reasoning LMMs (e.g., Flamingo‑V, Video‑ChatGPT) that generate CoT internally.
    • Prompt‑driven LMMs that receive an external CoT template via prompting.
  4. Self‑Reflection Measurement – After perturbation, the model’s final answer is examined to see if it detects and corrects the inconsistency.
  5. Active Visual‑Context Refinement – During inference, each CoT step triggers:
    • Visual Re‑grounding: The model extracts a fine‑grained visual feature map for the current claim and computes a consistency score.
    • Context Denoising: A lightweight transformer summarizes the reasoning history, down‑weighting statements flagged as inconsistent.
      This loop runs without any additional training data or parameter updates.

Results & Findings

ModelSelf‑Correction Rate (perturbed)Accuracy Gain with AVCR
Native LMM (Flamingo‑V)8 %+12.3 %
Prompt‑driven LMM (Video‑ChatGPT)6 %+10.7 %
Baseline (no AVCR)
  • Hallucination Propagation: In >90 % of perturbed cases, the erroneous textual claim persisted to the final answer.
  • AVCR Effectiveness: The active visual check caught ~70 % of the injected conflicts, and the context refinement prevented the error from contaminating later steps.
  • Speed Overhead: AVCR adds ~0.3× inference latency, a modest trade‑off for the robustness gain.

Practical Implications

  • More Reliable Video QA Systems: Deployments in surveillance, sports analytics, or e‑learning can now trust that a model won’t blindly follow a single mis‑detected object or event.
  • Debug‑Friendly AI Assistants: The visual re‑grounding step yields a confidence score per reasoning step, giving developers a diagnostic hook to surface where the model went off‑track.
  • Zero‑Shot Robustness: Since AVCR is training‑free, existing LMM services can be upgraded with a simple inference wrapper, avoiding costly fine‑tuning pipelines.
  • Cross‑Modal Consistency Checks: The protocol can be repurposed as a benchmark for any system that fuses language and vision, encouraging the community to build models that truly “look before they speak.”

Limitations & Future Work

  • Scope of Perturbations: The LogicGraph protocol currently focuses on binary truth flips; more nuanced semantic drifts (e.g., subtle attribute changes) remain unexplored.
  • Visual Grounding Granularity: AVCR relies on pre‑extracted frame‑level features; applying it to high‑resolution, long‑form video may increase computational cost.
  • Generalization to Other Modalities: The study is limited to video‑text; extending the approach to audio‑visual or text‑to‑3D scenarios is an open avenue.
  • User‑Controlled Trade‑offs: Future work could expose a tunable “refinement aggressiveness” parameter, letting developers balance latency against robustness per application.

Bottom line: By shining a light on textual inertia and offering a lightweight, inference‑only fix, this work nudges large multimodal models a step closer to trustworthy, real‑world reasoning.

Authors

  • Zhihao Zhu
  • Jiafeng Liang
  • Shixin Jiang
  • Jinlan Fu
  • Ming Liu
  • Guanglu Sun
  • See‑Kiong Ng
  • Bing Qin

Paper Information

  • arXiv ID: 2601.04073v1
  • Categories: cs.CV, cs.AI, cs.CL
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »