[Paper] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Published: (November 26, 2025 at 01:37 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21663v1

Overview

Vision‑Language‑Action (VLA) models are the backbone of many embodied AI systems, from household robots to AR assistants. This paper introduces ADVLA, a lightweight adversarial attack that perturbs the visual features after they have been projected into the language space, achieving near‑perfect disruption of downstream actions while keeping the visual changes tiny and highly localized.

Key Contributions

  • Feature‑space attack: Instead of modifying raw pixels, ADVLA injects perturbations directly into the visual encoder’s output that is fed to the language module.
  • Attention‑guided sparsity: Uses the model’s own attention maps to focus perturbations on the most influential patches, reducing the modified area to < 10 % of the image.
  • Three complementary strategies
    1. Sensitivity amplification – boosts gradients on high‑impact features.
    2. Sparse masking (Top‑K) – limits perturbations to the top‑K attended patches.
    3. Concentration regularization – encourages the perturbation mass to cluster on critical regions.
  • Efficiency: A single‑step attack runs in ~0.06 s per image, orders of magnitude faster than conventional patch‑based methods.
  • Strong empirical results: Under an $L_{\infty}=4/255$ budget, ADVLA attains ≈ 100 % attack success with barely perceptible changes.

Methodology

  1. Feature extraction – The visual encoder processes an input frame and produces a set of patch embeddings.
  2. Projection to language space – These embeddings are linearly projected into the textual feature space that the language model consumes.
  3. Gradient‑based perturbation – ADVLA computes the gradient of the downstream action loss w.r.t. the projected features.
  4. Attention guidance – The model’s cross‑modal attention scores identify which patches most influence the action prediction.
  5. Sparse masking – Only the top‑K patches (e.g., 5 %–10 % of all patches) are allowed to receive perturbations.
  6. Optimization – A single‑step (or few‑step) update applies the perturbation, clipped to the $L_{\infty}=4/255$ bound.

The whole pipeline bypasses any end‑to‑end retraining of the VLA model, making it a “plug‑and‑play” attack.

Results & Findings

MetricBaseline Patch AttackADVLA (Top‑K)
Attack Success Rate~85 %≈ 100 %
Modified Patch Ratio30 %–40 %< 10 %
Visual Distortion (PSNR)22 dB> 30 dB (near‑imperceptible)
Runtime per Image0.4 s0.06 s
  • Perturbations concentrate on semantically important regions (e.g., objects the robot is supposed to interact with).
  • Even with the strict $L_{\infty}=4/255$ constraint, the downstream policy’s action logits are flipped, demonstrating high sensitivity of VLA pipelines to feature‑space noise.
  • Ablation studies confirm that each of the three strategies (sensitivity, sparsity, concentration) contributes additively to the attack’s potency.

Practical Implications

  • Security testing for embodied AI – Developers can use ADVLA as a fast, low‑cost sanity check to evaluate the robustness of their VLA pipelines before deployment.
  • Design of defenses – The fact that tiny, sparse perturbations in feature space are enough to break the system suggests that future defenses should monitor attention‑weighted feature stability, not just pixel‑level anomalies.
  • Resource‑constrained environments – Because ADVLA runs in milliseconds on a single GPU, it can be integrated into continuous integration (CI) pipelines or on‑device testing suites.
  • Insight for model architects – The attack highlights that the projection layer from vision to language is a critical vulnerability; adding stochasticity or regularization there could improve resilience.

Limitations & Future Work

  • Scope of models – Experiments focus on a handful of popular VLA architectures; transferability to other multimodal setups (e.g., video‑language‑action) remains untested.
  • Physical‑world feasibility – While the perturbations are sparse, they are still digital; translating them into real‑world stickers or lighting changes is an open challenge.
  • Defense evaluation – The paper proposes the attack but does not benchmark existing defenses (e.g., adversarial training, feature denoising) against ADVLA.

Future research could explore universal (input‑agnostic) feature‑space perturbations, extend the method to video streams, and develop attention‑aware robustness metrics.

Authors

  • Naifu Zhang
  • Wei Tao
  • Xi Xiao
  • Qianpu Sun
  • Yuxin Zheng
  • Wentao Mo
  • Peiqiang Wang
  • Nan Zhang

Paper Information

  • arXiv ID: 2511.21663v1
  • Categories: cs.CV, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

It&#8217;s code red for ChatGPT

A smidge over three years ago, OpenAI threw the rest of the tech industry into chaos. When ChatGPT launched, even billed as a 'low-key research preview,' it bec...