[Paper] Adversarial Robustness of Vision in Open Foundation Models
Source: arXiv - 2512.17902v1
Overview
The paper Adversarial Robustness of Vision in Open Foundation Models examines how two popular open‑weight vision‑language models—LLaVA‑1.5‑13B and Meta’s Llama 3.2 Vision‑8B‑2—behave when their visual inputs are deliberately corrupted. By running untargeted Projected Gradient Descent (PGD) attacks on images from the VQA‑v2 benchmark, the authors quantify how much the models’ answer accuracy drops, revealing surprising differences in robustness that are not obvious from standard performance scores alone.
Key Contributions
- First systematic adversarial evaluation of open‑weight vision‑language models (VLMs) on a large‑scale VQA benchmark.
- Empirical comparison of two state‑of‑the‑art VLMs (LLaVA‑1.5‑13B vs. Llama 3.2 Vision‑8B‑2) under increasing PGD perturbation strengths.
- Discovery that higher baseline accuracy does not guarantee stronger adversarial robustness—Llama 3.2 Vision, despite a lower clean‑score, degrades more gracefully under attack.
- Quantitative analysis linking robustness to architectural and training choices, suggesting that model size, multimodal fusion strategy, and pre‑training data affect susceptibility.
- Open‑source release of attack scripts and perturbed VQA subsets, enabling the community to benchmark future VLMs against visual adversaries.
Methodology
- Models Tested – LLaVA‑1.5‑13B (a CLIP‑backbone + LLM fusion) and Meta’s Llama 3.2 Vision‑8B‑2 (a unified transformer with early visual token integration).
- Dataset – A curated subset of the VQA‑v2 dataset (≈10 k image‑question pairs) that covers a balanced mix of object, attribute, and counting questions.
- Attack Procedure – Untargeted PGD is applied directly to the raw pixel values of each image. The attack runs for 40 iterations with step sizes tuned to achieve ℓ∞ perturbation budgets of 2/255, 4/255, 8/255, and 16/255. No gradient information from the language component is used; only the visual encoder’s loss is back‑propagated.
- Evaluation Metric – Standard VQA accuracy (the proportion of answers that match the human‑provided ground truth after majority voting). Accuracy is reported for clean images and for each perturbation level.
- Analysis – The authors compute accuracy drop (clean – adversarial) and plot robustness curves, then correlate these with model architecture details (e.g., depth of visual encoder, token‑level fusion).
Results & Findings
| Perturbation (ℓ∞) | LLaVA‑1.5‑13B Clean Acc. | LLaVA‑1.5‑13B Acc. | Llama 3.2 Vision‑8B‑2 Clean Acc. | Llama 3.2 Vision Acc. |
|---|---|---|---|---|
| 0 (clean) | 71.2 % | — | 64.8 % | — |
| 2/255 | 71.2 % | 58.9 % (‑12.3) | 64.8 % | 55.6 % (‑9.2) |
| 4/255 | 71.2 % | 45.3 % (‑25.9) | 64.8 % | 48.9 % (‑15.9) |
| 8/255 | 71.2 % | 28.7 % (‑42.5) | 64.8 % | 36.2 % (‑28.6) |
| 16/255 | 71.2 % | 12.4 % (‑58.8) | 64.8 % | 21.5 % (‑43.3) |
Key take‑aways
- Both models suffer dramatic accuracy loss as perturbation strength grows, confirming that the visual channel is a viable attack surface.
- Llama 3.2 Vision consistently loses less accuracy than LLaVA at every perturbation level, despite starting from a lower clean baseline.
- The relative robustness gap widens at higher ε, suggesting that Llama 3.2’s early visual‑token integration may provide implicit regularization against pixel‑level noise.
- No simple linear relationship exists between clean performance and robustness; architectural choices (e.g., depth of visual encoder, token‑fusion timing) appear to matter more.
Practical Implications
- Security‑by‑Design for Multimodal Apps – Developers building chat‑bots, image‑search, or assistive tools that rely on VLMs should treat the visual front‑end as a potential attack vector. Simple image preprocessing (e.g., JPEG compression, denoising) could mitigate low‑budget PGD attacks.
- Model Selection – When robustness matters more than raw VQA accuracy (e.g., in safety‑critical inspection or medical imaging), Llama 3.2 Vision may be a better default despite its lower clean score.
- Adversarial Testing Pipelines – The released PGD scripts can be integrated into CI pipelines to automatically flag regressions in visual robustness as new model versions are fine‑tuned or quantized.
- Guidance for Fine‑Tuning – The findings suggest that fine‑tuning on noisy or augmented visual data could improve robustness without sacrificing much accuracy, a practical recipe for teams that already have a VLM in production.
- Regulatory & Compliance – For industries where AI explainability and reliability are mandated (e.g., autonomous driving), demonstrating resistance to visual adversaries becomes part of the compliance checklist.
Limitations & Future Work
- Untargeted PGD only – The study focuses on untargeted attacks; targeted or perceptually‑constrained attacks (e.g., patch‑based, style‑transfer) could behave differently.
- Single Dataset – Results are reported on a VQA‑v2 subset; other vision‑language tasks (image captioning, visual grounding) may exhibit distinct robustness patterns.
- No Defense Evaluation – The paper does not test common defenses (adversarial training, input preprocessing), leaving open the question of how much robustness can be regained.
- Architectural Attribution – While the authors hypothesize that early token fusion helps, a deeper ablation (varying fusion depth, encoder size) is needed to pinpoint causal factors.
- Scalability – Experiments are limited to 13 B and 8 B models; it remains unclear whether larger foundation models (e.g., 70 B) follow the same trend.
Future directions could include targeted attacks, cross‑task robustness studies, systematic defense benchmarking, and a deeper architectural ablation to guide the next generation of robust vision‑language foundations.
Authors
- Jonathon Fox
- William J Buchanan
- Pavlos Papadopoulos
Paper Information
- arXiv ID: 2512.17902v1
- Categories: cs.CV, cs.AI, cs.CR
- Published: December 19, 2025
- PDF: Download PDF