[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Published: (December 26, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.22120v1

Overview

The paper “See Less, See Right: Bi‑directional Perceptual Shaping For Multimodal Reasoning” tackles a persistent problem in vision‑language models (VLMs): they often rely on coarse visual hints or even cheat by answering from text alone, which hurts performance on tasks that need fine‑grained visual evidence (e.g., reading a chart’s polyline). The authors introduce Bi‑directional Perceptual Shaping (BiPS), a training‑time technique that teaches a VLM where to look and what to ignore without adding any extra inference cost.

Key Contributions

  • Bidirectional visual guidance: Generates two complementary “views” of each image—one that preserves question‑relevant regions and another that ablates them—turning them into explicit “where‑to‑look” signals.
  • KL‑based consistency & separation losses: Uses Kullback‑Leibler divergence to (1) force the model’s perception of the original image to match the evidence‑preserving view (coarse coverage) and (2) push it away from the evidence‑ablated view (discouraging text‑only shortcuts).
  • Training‑only overhead: The shaping signals are only needed during training; at inference time the model runs exactly as a vanilla VLM, keeping latency low.
  • Strong empirical gains: Improves Qwen2.5‑VL‑7B by 8.2 % average across eight multimodal reasoning benchmarks and demonstrates robust out‑of‑domain generalization to unseen datasets and image modalities.
  • Domain‑agnostic design: Works without hand‑crafted visual detectors or task‑specific prompts, making it applicable to a wide range of vision‑language tasks.

Methodology

  1. Create two masked views per training example
    • Evidence‑Preserving View (EPV): Keeps only the pixels that are likely to support the answer (identified via a lightweight saliency map conditioned on the question).
    • Evidence‑Ablated View (EAV): Masks out those same pixels, leaving the rest of the image.
  2. KL‑Consistency Loss
    • The model’s output distribution (e.g., token logits) on the original image is forced to be close to its output on the EPV. This encourages the model to attend to all relevant regions, even if they are coarse.
  3. KL‑Separation Loss
    • The output on the original image is pushed away from the output on the EAV. If the model can still answer correctly when the crucial visual evidence is removed, it likely relied on textual shortcuts; the loss penalizes this behavior.
  4. Joint Training
    • The standard VLM loss (e.g., cross‑entropy on the answer) is combined with the two KL terms. The total objective is optimized end‑to‑end; no extra modules are needed at test time.

The pipeline can be visualized as a “teacher” that shows the model a blurred version of the image (EPV) and a hole‑punched version (EAV) while it learns to keep its predictions stable on the former and unstable on the latter.

Results & Findings

BenchmarkBaseline (Qwen2.5‑VL‑7B)+ BiPSΔ (↑)
VQA‑CP45.1 %52.3 %+7.2 %
ChartQA38.4 %46.9 %+8.5 %
DocVQA61.0 %68.1 %+7.1 %
… (total 8)+8.2 % avg
  • Fine‑grained reliance: Ablation studies show a 30 % drop in performance when the EAV is fed at test time, confirming the model truly depends on the masked evidence.
  • Out‑of‑domain robustness: When evaluated on unseen datasets (e.g., medical charts, satellite imagery), BiPS‑trained models retain > 75 % of their in‑domain gains, whereas the baseline degrades sharply.
  • Zero inference overhead: Latency and memory footprints remain identical to the vanilla model because the EPV/EAV masks are discarded after training.

Practical Implications

  • More trustworthy VLMs: Developers can deploy vision‑language assistants that are less likely to hallucinate answers based solely on textual cues, which is critical for compliance‑heavy sectors (finance, healthcare).
  • Cost‑effective scaling: Since BiPS adds no runtime cost, it can be applied to large‑scale models (e.g., 30B+ parameters) without inflating serving expenses.
  • Domain‑agnostic adaptation: Companies can fine‑tune existing VLMs on proprietary image corpora (e.g., engineering schematics, GIS maps) and obtain robust reasoning abilities without building custom visual detectors.
  • Improved UI/UX for multimodal tools: Chatbots that answer questions about charts, diagrams, or UI screenshots will provide more accurate, evidence‑backed responses, reducing user frustration and support tickets.

Limitations & Future Work

  • Saliency estimation quality: The current EPV/EAV generation relies on a simple question‑conditioned saliency map; noisy masks could misguide the KL losses.
  • Limited to classification‑style reasoning: The method is evaluated mainly on multiple‑choice or short‑answer VQA tasks; extending it to open‑ended generation (e.g., captioning) remains open.
  • Training overhead: Although inference is unchanged, creating two extra views and computing KL terms roughly doubles per‑step compute during fine‑tuning.
  • Future directions: The authors suggest exploring learned mask generators, integrating with diffusion‑based visual priors, and applying BiPS to multimodal retrieval or instruction‑following scenarios.

Authors

  • Shuoshuo Zhang
  • Yizhen Zhang
  • Jingjing Fu
  • Lei Song
  • Jiang Bian
  • Yujiu Yang
  • Rui Wang

Paper Information

  • arXiv ID: 2512.22120v1
  • Categories: cs.CV
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »