[Paper] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Published: 2 months ago (November 26, 2025 at 01:37 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21663v1

Overview

Vision‑Language‑Action (VLA) models are the backbone of many embodied AI systems, from household robots to AR assistants. This paper introduces ADVLA, a lightweight adversarial attack that perturbs the visual features after they have been projected into the language space, achieving near‑perfect disruption of downstream actions while keeping the visual changes tiny and highly localized.

Key Contributions

Feature‑space attack: Instead of modifying raw pixels, ADVLA injects perturbations directly into the visual encoder’s output that is fed to the language module.
Attention‑guided sparsity: Uses the model’s own attention maps to focus perturbations on the most influential patches, reducing the modified area to < 10 % of the image.
Three complementary strategies
1. Sensitivity amplification – boosts gradients on high‑impact features.
2. Sparse masking (Top‑K) – limits perturbations to the top‑K attended patches.
3. Concentration regularization – encourages the perturbation mass to cluster on critical regions.
Efficiency: A single‑step attack runs in ~0.06 s per image, orders of magnitude faster than conventional patch‑based methods.
Strong empirical results: Under an $L_{\infty}=4/255$ budget, ADVLA attains ≈ 100 % attack success with barely perceptible changes.

Methodology

Feature extraction – The visual encoder processes an input frame and produces a set of patch embeddings.
Projection to language space – These embeddings are linearly projected into the textual feature space that the language model consumes.
Gradient‑based perturbation – ADVLA computes the gradient of the downstream action loss w.r.t. the projected features.
Attention guidance – The model’s cross‑modal attention scores identify which patches most influence the action prediction.
Sparse masking – Only the top‑K patches (e.g., 5 %–10 % of all patches) are allowed to receive perturbations.
Optimization – A single‑step (or few‑step) update applies the perturbation, clipped to the $L_{\infty}=4/255$ bound.

The whole pipeline bypasses any end‑to‑end retraining of the VLA model, making it a “plug‑and‑play” attack.

Results & Findings

Metric	Baseline Patch Attack	ADVLA (Top‑K)
Attack Success Rate	~85 %	≈ 100 %
Modified Patch Ratio	30 %–40 %	< 10 %
Visual Distortion (PSNR)	22 dB	> 30 dB (near‑imperceptible)
Runtime per Image	0.4 s	0.06 s

Perturbations concentrate on semantically important regions (e.g., objects the robot is supposed to interact with).
Even with the strict $L_{\infty}=4/255$ constraint, the downstream policy’s action logits are flipped, demonstrating high sensitivity of VLA pipelines to feature‑space noise.
Ablation studies confirm that each of the three strategies (sensitivity, sparsity, concentration) contributes additively to the attack’s potency.

Practical Implications

Security testing for embodied AI – Developers can use ADVLA as a fast, low‑cost sanity check to evaluate the robustness of their VLA pipelines before deployment.
Design of defenses – The fact that tiny, sparse perturbations in feature space are enough to break the system suggests that future defenses should monitor attention‑weighted feature stability, not just pixel‑level anomalies.
Resource‑constrained environments – Because ADVLA runs in milliseconds on a single GPU, it can be integrated into continuous integration (CI) pipelines or on‑device testing suites.
Insight for model architects – The attack highlights that the projection layer from vision to language is a critical vulnerability; adding stochasticity or regularization there could improve resilience.

Limitations & Future Work

Scope of models – Experiments focus on a handful of popular VLA architectures; transferability to other multimodal setups (e.g., video‑language‑action) remains untested.
Physical‑world feasibility – While the perturbations are sparse, they are still digital; translating them into real‑world stickers or lighting changes is an open challenge.
Defense evaluation – The paper proposes the attack but does not benchmark existing defenses (e.g., adversarial training, feature denoising) against ADVLA.

Future research could explore universal (input‑agnostic) feature‑space perturbations, extend the method to video streams, and develop attention‑aware robustness metrics.

Authors

Naifu Zhang
Wei Tao
Xi Xiao
Qianpu Sun
Yuxin Zheng
Wentao Mo
Peiqiang Wang
Nan Zhang

Paper Information

arXiv ID: 2511.21663v1
Categories: cs.CV, cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

AI models block 87% of single attacks, but just 8% when attackers persist

[Paper] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

AI agents find $4.6M in blockchain smart contract exploits

Apple AI chief steps down following Siri setbacks