[Paper] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Published: 1 week ago (January 29, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.22150v1

Overview

The paper “Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions” investigates why large Vision‑Language Models (VLMs) often give the same answer to an illusion‑laden image even after the illusion is flipped—a change that humans notice instantly. By introducing a systematic probing framework (VI‑Probe), the authors tease apart whether VLMs are truly seeing the visual change or simply recalling a memorized pattern from their training data.

Key Contributions

VI‑Probe framework: A controllable suite of classic visual‑illusion stimuli with graded perturbations and matched non‑illusion controls, enabling fine‑grained probing of VLM behavior.
New evaluation metrics:
- Polarity‑Flip Consistency (measures whether a model flips its answer when the illusion polarity is reversed).
- Template Fixation Index (captures reliance on static textual templates).
- Illusion Multiplier (normalizes illusion‑induced response changes against matched controls).
Comprehensive empirical study across multiple VLM families (GPT‑5, Claude‑Opus‑4.1, Qwen‑variants, etc.), revealing heterogeneous failure modes rather than a single “memory‑only” explanation.
Open‑source release of the dataset, code, and analysis scripts, encouraging reproducible probing of future VLMs.

Methodology

Stimulus Design – The authors selected several classic visual‑illusion families (e.g., Müller‑Lyer, Kanizsa, and Rubin’s vase). For each illusion they generated three versions:
- Original (standard illusion).
- Polarity‑flipped (the illusion’s cue is inverted, producing the opposite percept).
- Control (same visual layout but without the illusion‑inducing element).
  The images are rendered at multiple contrast levels to create a graded perturbation spectrum.
Prompting Protocol – Each image is fed to a VLM together with a short, fixed question (e.g., “What shape do you see?”). The same prompt is used across all three versions to isolate visual influence from language bias.
Metric Computation –
- Polarity‑Flip Consistency = proportion of cases where the model’s answer flips when the illusion polarity flips.
- Template Fixation Index = similarity between answers on illusion and control images (high values indicate reliance on a memorized textual template).
- Illusion Multiplier = (response change on illusion) / (response change on control), quantifying visual sensitivity beyond baseline language drift.
Model Suite – The study evaluates 9 state‑of‑the‑art VLMs, ranging from multimodal GPT‑5 to open‑source Qwen‑VL, covering both proprietary and academic systems.

Results & Findings

Model	Polarity‑Flip Consistency	Template Fixation Index	Illusion Multiplier
GPT‑5	0.12 (low)	0.84 (high)	0.15 (memory‑dominated)
Claude‑Opus‑4.1	0.48 (moderate)	0.62 (mixed)	0.55 (perception‑memory tug‑of‑war)
Qwen‑VL‑7B	0.71 (higher)	0.41 (more visual)	0.78 (visual‑processing limited)
…	…	…	…

No single failure mode: Some models (GPT‑5) largely ignore the visual flip, suggesting a memory override where a learned textual pattern dominates. Others (Claude‑Opus‑4.1) show a competition between visual cues and memorized templates, flipping answers only on higher‑contrast flips. Qwen variants react more to the visual change but still exhibit a ceiling effect, hinting at visual‑processing capacity limits.
Gradient sensitivity: Across all models, higher contrast (stronger illusion) yields higher Illusion Multipliers, confirming that VLMs are not completely blind to visual changes but are far less sensitive than humans.
Control baseline: Even on control images (no illusion), models occasionally drift in their answers, underscoring the importance of normalizing against language‑only noise.

Practical Implications

Reliability of VLM‑driven UI/UX – Applications that rely on VLMs for visual QA (e.g., accessibility tools that describe images) may produce stable but incorrect descriptions when faced with subtle visual cues or adversarial patterns.
Safety & Content Moderation – If a VLM can be “tricked” into ignoring visual changes, malicious actors could embed harmful visual signals that the model fails to notice, while textual prompts remain benign.
Model Debugging & Auditing – The VI‑Probe metrics give engineers concrete diagnostics to spot whether a model is over‑relying on language priors versus genuine visual perception, guiding targeted fine‑tuning or architectural changes.
Benchmark Design – The framework can be extended to other domains (e.g., medical imaging) where distinguishing perception from memorized patterns is critical.

Limitations & Future Work

Scope of Illusions – The study focuses on a handful of classic 2‑D illusions; more complex, real‑world visual ambiguities (e.g., lighting changes, occlusions) remain untested.
Prompt Diversity – Using a single fixed prompt isolates visual effects but does not capture how prompt engineering might mitigate or exacerbate memory bias.
Model Access – Some proprietary VLMs (e.g., GPT‑5) were evaluated via API with limited control over internal representations, potentially conflating inference‑time caching with true perception.
Future Directions – The authors suggest expanding VI‑Probe to video streams, integrating eye‑tracking data for human baselines, and exploring training‑time interventions (e.g., contrastive visual‑language objectives) to reduce template fixation.

Bottom line: This work shows that today’s large VLMs are still far from human‑like visual perception. By providing a systematic probing toolkit, the authors give developers a practical way to audit and improve the visual sensitivity of the models that increasingly power our apps.

Authors

Xiaoxiao Sun
Mingyang Li
Kun yuan
Min Woo Sun
Mark Endo
Shengguang Wu
Changlin Li
Yuhui Zhang
Zeyu Wang
Serena Yeung‑Levy

Paper Information

arXiv ID: 2601.22150v1
Categories: cs.CV
Published: January 29, 2026
PDF: Download PDF

[Paper] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

[Paper] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists