[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile

Published: 1 month ago (December 19, 2025 at 01:26 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17875v1

Overview

The paper uncovers a surprising weakness in today’s visual‑language models (VLMs): when a benchmark relies on visual prompting—tiny markers (e.g., colored boxes) that tell the model where to look—tiny, seemingly irrelevant changes to those markers can flip model rankings dramatically. By systematically tweaking marker color, size, and even JPEG compression, the authors show that benchmark results can be gamed, calling into question the reliability of many current VLM leaderboards.

Key Contributions

Empirical fragility analysis of nine popular open‑ and closed‑source VLMs on two visual‑prompting tasks.
Demonstration of “benchmark hacking”: simple visual marker tweaks (color, size) can promote weaker models (e.g., InternVL‑3‑8B) above far larger proprietary systems.
Identification of low‑level inference factors (JPEG compression, API image preprocessing) that disproportionately affect visually prompted benchmarks.
Creation of VPBench, a curated, larger benchmark containing 16 marker variants and accompanying analysis tools to reduce instability.
Open‑source release of the dataset and evaluation scripts (https://lisadunlap.github.io/vpbench/), enabling reproducible and more robust VLM testing.

Methodology

Benchmark selection – The authors reused two existing visual‑prompting datasets (e.g., BLINK) where each question is paired with a colored marker placed on the image.
Model suite – Nine VLMs were evaluated, spanning open‑source (InternVL‑3‑8B, LLaVA, etc.) and closed‑source commercial APIs (Gemini 2.5 Pro, GPT‑4V, etc.).
Prompt perturbations – For each image, the visual marker was systematically altered across several dimensions:
- Color (red → blue, green, etc.)
- Size (tiny → slightly larger)
- Opacity / border style
- Compression (different JPEG quality levels)
Evaluation pipeline – The same textual question was sent to each model with the altered image; answers were scored using the original ground‑truth labels.
Statistical analysis – Rankings, mean accuracy, and variance were computed for each perturbation to quantify sensitivity.
Benchmark redesign – Based on observed sensitivities, the authors aggregated all marker variants into a single, larger benchmark (VPBench) and provided scripts to compute robust scores (e.g., averaging across variants).

Results & Findings

Aspect	Observation
Marker color	Switching from red to blue caused up to a 30 % drop in accuracy for some models, while others improved, reshuffling the leaderboard.
Marker size	Slightly enlarging the marker (by ~10 px) lifted the open‑source InternVL‑3‑8B to parity with Gemini 2.5 Pro on the original benchmark.
JPEG compression	Varying compression from quality 100 to 70 altered rankings for 5 of the 9 models, even though the visual content remained semantically identical.
Overall variance	Across all perturbations, the standard deviation of model scores was 2–3× higher than on conventional (non‑prompted) VLM benchmarks.
VPBench impact	When evaluated on the aggregated 16‑variant VPBench, the variance dropped by ≈45 %, and rankings became more stable across perturbations.

The key takeaway is that visual prompting introduces a hidden “visual prior” that models latch onto, making them vulnerable to low‑level visual cues that are irrelevant to the actual reasoning task.

Practical Implications

Benchmark design: Teams building VLM evaluation suites should avoid single‑variant visual prompts; instead, they should randomize marker attributes or use multiple variants (as VPBench does).
Model debugging: Developers can use the provided analysis tools to diagnose whether a model is over‑fitting to marker color/size rather than truly understanding the image content.
API usage: When calling commercial VLM APIs, be aware that image preprocessing (e.g., automatic JPEG compression) can unintentionally bias results—consider sending lossless formats or controlling compression levels.
Product reliability: Applications that rely on VLMs for visual QA (e.g., document analysis, medical imaging assistants) should not assume robustness to minor visual artifacts; thorough testing with varied prompts is essential.
Fair competition: Leaderboards that rank VLMs should disclose visual prompt specifications and possibly report robustness scores (performance averaged over multiple marker styles).

Limitations & Future Work

Scope of tasks: The study focuses on two visual‑prompting tasks; broader task families (e.g., video QA, multimodal reasoning) may exhibit different sensitivities.
Model diversity: While nine models were tested, the rapidly expanding VLM ecosystem means newer architectures could behave differently.
Human perception baseline: The paper does not compare model fragility to human performance on the same perturbed prompts, leaving open the question of whether the observed effects are uniquely machine‑centric.
Mitigation strategies: VPBench reduces variance but does not eliminate it; future work could explore training‑time regularization (e.g., marker‑agnostic data augmentation) to make models inherently robust.

By highlighting these gaps, the authors invite the community to develop more stable evaluation practices and to design VLMs that truly “see” beyond superficial visual cues.

Authors

Haiwen Feng
Long Lian
Lisa Dunlap
Jiahao Shu
XuDong Wang
Renhao Wang
Trevor Darrell
Alane Suhr
Angjoo Kanazawa

Paper Information

arXiv ID: 2512.17875v1
Categories: cs.CV, cs.LG
Published: December 19, 2025
PDF: Download PDF

[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] RadarGen: Automotive Radar Point Cloud Generation from Cameras

[Paper] Interpretable Plant Leaf Disease Detection Using Attention-Enhanced CNN