[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Published: (February 19, 2026 at 01:54 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.17645v1

Overview

The paper introduces M‑Attack‑V2, a set of simple yet powerful upgrades to the existing M‑Attack framework for black‑box adversarial attacks on Large Vision‑Language Models (LVLMs). By tackling the high‑variance gradients that plague prior transfer‑based attacks, the authors achieve dramatic gains in success rates against cutting‑edge models such as Claude‑4.0, Gemini‑2.5‑Pro, and GPT‑5, while keeping the attack pipeline fully black‑box (no gradient access).

Key Contributions

  • Diagnoses gradient instability in the original M‑Attack, linking it to ViT translation sensitivity and asymmetric source‑target crop handling.
  • Multi‑Crop Alignment (MCA): averages gradients from several independently sampled source crops per iteration, dramatically reducing variance.
  • Auxiliary Target Alignment (ATA): replaces aggressive target augmentations with a small, semantically‑aligned auxiliary target set, smoothing the target manifold.
  • Patch Momentum: reinterprets momentum at the patch level, replaying historical crop gradients to reinforce consistent directions.
  • Patch‑Size Ensemble (PE+): refines the ensemble of patch sizes to capture richer transferable cues.
  • M‑Attack‑V2: a modular, drop‑in improvement over M‑Attack that pushes black‑box LVLM attack success from single‑digit percentages to near‑perfect rates on several state‑of‑the‑art models.
  • Open‑source release of code, data, and pretrained attack configurations.

Methodology

  1. Problem Setup – In a black‑box setting, the attacker can only query an LVLM with image‑text pairs and observe the model’s output. The goal is to craft a perturbed image that forces the LVLM to produce a targeted (incorrect) response.
  2. Original M‑Attack Recap – Uses local crop‑level matching: random crops of the source image are aligned with crops of a target image, and gradients are estimated via transfer from a surrogate model.
  3. Why It Fails
    • ViT translation sensitivity creates “spike‑like” gradients that change dramatically with small crop shifts.
    • Asymmetric source/target crops generate nearly orthogonal gradient directions across iterations, leading to noisy updates.
  4. Multi‑Crop Alignment (MCA) – For each iteration, sample N independent crops from the source image, compute their surrogate gradients, and average them. This expectation over source transformations stabilizes the direction.
  5. Auxiliary Target Alignment (ATA) – Instead of heavily augmenting the target image (which expands the target manifold), draw a small set of auxiliary target images from a semantically related distribution (e.g., same class or caption). The attack aligns the source crops to this smoother target set, reducing variance on the target side.
  6. Patch Momentum – Traditional momentum accumulates full‑image gradients. Patch Momentum stores momentum per ViT patch, allowing the optimizer to “replay” historically consistent patch‑level directions.
  7. Patch‑Size Ensemble (PE+) – Run the attack simultaneously over several patch sizes (e.g., 16×16, 32×32) and aggregate the resulting gradients, capturing both fine‑grained and coarse cues.
  8. Putting It All Together – The modules are orthogonal and can be toggled independently. In practice, the authors use MCA + ATA + Patch Momentum + PE+ as a single pipeline (M‑Attack‑V2).

Results & Findings

Target LVLMBaseline M‑Attack SuccessM‑Attack‑V2 Success
Claude‑4.08 %30 %
Gemini‑2.5‑Pro83 %97 %
GPT‑598 %100 %
  • Gradient variance drops by ~70 % when MCA is applied, as measured by the norm of gradient differences across iterations.
  • ATA reduces the average cosine distance between successive target gradients from 0.45 to 0.12, indicating a smoother target landscape.
  • Patch Momentum yields a 5–10 % boost in transferability on top of MCA + ATA, especially for larger ViT backbones.
  • PE+ contributes an additional 2–3 % improvement, confirming that multi‑scale patch information is complementary.

Overall, the attack remains black‑box (only query access) while achieving transfer success rates that rival white‑box methods on the tested LVLMs.

Practical Implications

  • Security Auditing for Multimodal Products – Companies deploying LVLMs (e.g., visual assistants, content moderation tools) can now evaluate robustness with a lightweight, query‑only attack suite.
  • Defensive Research – The identified failure modes (translation‑sensitive ViT gradients, asymmetric crop handling) give concrete targets for defenses such as gradient masking, randomized patch shuffling, or robust data augmentation.
  • Adversarial Training Pipelines – MCA and ATA can be repurposed as data‑centric augmentation strategies: training with multi‑crop, semantically‑aligned pairs may improve model invariance to fine‑grained perturbations.
  • Benchmarking Transferability – M‑Attack‑V2 provides a strong baseline for future black‑box LVLM attack research, enabling fairer comparisons across papers.
  • Tooling for Red‑Teamers – The open‑source implementation can be integrated into existing red‑team frameworks (e.g., AutoAttack, Foolbox) to extend their coverage to multimodal models without needing gradient access.

Limitations & Future Work

  • Query Budget – While the attack is black‑box, achieving near‑perfect success on GPT‑5 still requires thousands of queries, which may be impractical against rate‑limited APIs.
  • Dependence on Surrogate Model – Transferability hinges on the quality of the surrogate LVLM; attacks may degrade against models with substantially different architectures.
  • Semantic Auxiliary Set Construction – ATA assumes access to a small, semantically related image pool; generating such sets automatically for arbitrary targets remains an open challenge.
  • Defense Evaluation – The paper focuses on attack performance; systematic testing against existing defenses (e.g., input randomization, detection mechanisms) is left for future studies.
  • Extending Beyond Vision‑Language – Applying the same gradient‑denoising ideas to pure language or audio‑text multimodal models is a promising direction.

Overall, M‑Attack‑V2 shines a light on the hidden fragilities of today’s most capable LVLMs and equips practitioners with a practical tool to probe—and eventually harden—these systems.

Authors

  • Xiaohan Zhao
  • Zhaoyi Li
  • Yaxin Luo
  • Jiacheng Cui
  • Zhiqiang Shen

Paper Information

  • arXiv ID: 2602.17645v1
  • Categories: cs.LG, cs.AI, cs.CL, cs.CV
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »