[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Published: 3 days ago (February 19, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.17659v1

Overview

Vision‑Language‑Action (VLA) models are the backbone of robots that can follow natural‑language commands, but they often “cheat” by relying on visual shortcuts learned from biased datasets. This paper introduces LIBERO‑CF, the first benchmark that deliberately flips language instructions while keeping the visual scene plausible, exposing how often VLAs ignore the spoken intent. The authors also propose a lightweight inference add‑on—Counterfactual Action Guidance (CAG)—that dramatically cuts these failures without retraining the underlying model.

Key Contributions

LIBERO‑CF benchmark: a counterfactual test suite that pairs each robot scene with alternative, contradictory language commands, quantifying “language‑following accuracy.”
Systematic diagnosis of state‑of‑the‑art VLAs, showing that counterfactual failures are widespread even in top‑performing models.
Counterfactual Action Guidance (CAG): a dual‑branch, training‑free inference wrapper that compares a standard VLA policy with a language‑agnostic Vision‑Action (VA) policy to detect and suppress shortcut‑driven actions.
Plug‑and‑play compatibility: CAG works with any existing VLA architecture or pretrained weights—no extra demonstrations, fine‑tuning, or architectural changes required.
Extensive empirical validation on simulated LIBERO‑CF tasks and real‑world robot setups, reporting consistent gains in both language fidelity and overall task success.

Methodology

Counterfactual Benchmark Construction
- Start from the LIBERO robot manipulation suite (various object layouts, grasping/placing tasks).
- For each scene, generate an alternative natural‑language instruction that is plausible but contradicts the original goal (e.g., “pick up the red block” → “push the blue block”).
- Keep the visual observation unchanged, forcing the model to rely on language rather than visual frequency cues.
Baseline VLA Evaluation
- Run several recent VLA models (e.g., CLIP‑based, Transformer‑based) on the original and counterfactual instructions.
- Measure two metrics:
  - π₀.₅ (language‑following accuracy) – proportion of actions that align with the given instruction.
  - Task success – whether the robot completes the intended manipulation.
Counterfactual Action Guidance (CAG)
- Dual‑branch inference:
  - VLA branch – the standard policy that conditions on both vision and language.
  - VA branch – a language‑unconditioned vision‑only policy that predicts the most “habitual” action given the scene.
- Counterfactual comparison: at each decision step, compute the action distributions from both branches. If the VLA’s top action deviates significantly from the VA’s (i.e., the VA is confident about a shortcut), CAG down‑weights that action and selects the next best VLA action that is more language‑consistent.
- No extra training data; the VA model can be a frozen checkpoint or even a simple heuristic controller.
Integration & Evaluation
- Plug CAG into each VLA’s inference pipeline.
- Test on LIBERO‑CF and on a handful of real‑world robot setups (e.g., tabletop pick‑and‑place with a Franka arm).

Results & Findings

Model	Baseline π₀.₅	CAG (training‑free) π₀.₅	CAG + VA π₀.₅	Baseline Success	CAG (training‑free) Success	CAG + VA Success
VLA‑A	62.1 %	71.8 % (+9.7 %)	77.6 % (+15.5 %)	68.3 %	71.9 % (+3.6 %)	76.8 % (+8.5 %)

Counterfactual failures were observed in >40 % of under‑observed tasks across all baselines.
CAG (training‑free) already yields a double‑digit lift in language‑following accuracy, proving that many errors stem from inference bias rather than model capacity.
Adding a modest VA module (trained only on visual demonstrations) pushes the gains even higher.
Real‑world tests: average counterfactual failure rate dropped from 9.4 % to 2.1 %, and overall task success rose by 17.2 %.

The takeaway: a simple inference‑time sanity check can dramatically improve a robot’s obedience to language, without costly data collection or model redesign.

Practical Implications

Plug‑and‑play safety layer: Developers can wrap any existing VLA with CAG to add a “language sanity check,” reducing the risk of robots acting on the wrong object—a critical safety concern for collaborative robots (cobots).
Cost‑effective robustness: Since CAG needs no extra demonstrations, teams can improve deployed systems without expanding their data pipelines.
Debugging tool: The dual‑branch output highlights when a model is leaning on visual shortcuts, giving engineers actionable insight into dataset bias.
Transfer to other modalities: The same counterfactual‑comparison idea could be applied to multimodal assistants (e.g., vision‑language chatbots) to guard against hallucinated actions.
Benchmark adoption: LIBERO‑CF offers a ready‑made stress test for any VLA product before release, ensuring language compliance under ambiguous visual conditions.

Limitations & Future Work

Scope of counterfactuals: LIBERO‑CF focuses on object‑centric manipulation; more complex tasks (e.g., tool use, multi‑step recipes) remain untested.
VA quality dependence: While CAG works with a frozen vision‑only policy, its effectiveness scales with how well the VA captures common shortcuts; poorly trained VA models could introduce noise.
Latency overhead: Running two inference branches doubles compute at runtime, which may be prohibitive for ultra‑low‑latency edge robots. Optimizations (e.g., shared visual encoder) are left for future engineering.
Theoretical guarantees: The paper provides empirical evidence but no formal bound on how much counterfactual bias can be eliminated. A deeper analysis of the underlying distributional shift is an open research direction.

Overall, the work shines a light on a hidden failure mode in robot language grounding and offers a pragmatic, immediately usable fix—making VLAs safer and more trustworthy for real‑world deployment.

Authors

Yu Fang
Yuchun Feng
Dong Jing
Jiaqi Liu
Yue Yang
Zhenyu Wei
Daniel Szafir
Mingyu Ding

Paper Information

arXiv ID: 2602.17659v1
Categories: cs.CV, cs.RO
Published: February 19, 2026
PDF: Download PDF

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

[Paper] Human-level 3D shape perception emerges from multi-view learning

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] IntRec: Intent-based Retrieval with Contrastive Refinement