[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile

Published: (December 19, 2025 at 01:26 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.17875v1

Overview

The paper uncovers a surprising weakness in today’s visual‑language models (VLMs): when a benchmark relies on visual prompting—tiny markers (e.g., colored boxes) that tell the model where to look—tiny, seemingly irrelevant changes to those markers can flip model rankings dramatically. By systematically tweaking marker color, size, and even JPEG compression, the authors show that benchmark results can be gamed, calling into question the reliability of many current VLM leaderboards.

Key Contributions

  • Empirical fragility analysis of nine popular open‑ and closed‑source VLMs on two visual‑prompting tasks.
  • Demonstration of “benchmark hacking”: simple visual marker tweaks (color, size) can promote weaker models (e.g., InternVL‑3‑8B) above far larger proprietary systems.
  • Identification of low‑level inference factors (JPEG compression, API image preprocessing) that disproportionately affect visually prompted benchmarks.
  • Creation of VPBench, a curated, larger benchmark containing 16 marker variants and accompanying analysis tools to reduce instability.
  • Open‑source release of the dataset and evaluation scripts (https://lisadunlap.github.io/vpbench/), enabling reproducible and more robust VLM testing.

Methodology

  1. Benchmark selection – The authors reused two existing visual‑prompting datasets (e.g., BLINK) where each question is paired with a colored marker placed on the image.
  2. Model suite – Nine VLMs were evaluated, spanning open‑source (InternVL‑3‑8B, LLaVA, etc.) and closed‑source commercial APIs (Gemini 2.5 Pro, GPT‑4V, etc.).
  3. Prompt perturbations – For each image, the visual marker was systematically altered across several dimensions:
    • Color (red → blue, green, etc.)
    • Size (tiny → slightly larger)
    • Opacity / border style
    • Compression (different JPEG quality levels)
  4. Evaluation pipeline – The same textual question was sent to each model with the altered image; answers were scored using the original ground‑truth labels.
  5. Statistical analysis – Rankings, mean accuracy, and variance were computed for each perturbation to quantify sensitivity.
  6. Benchmark redesign – Based on observed sensitivities, the authors aggregated all marker variants into a single, larger benchmark (VPBench) and provided scripts to compute robust scores (e.g., averaging across variants).

Results & Findings

AspectObservation
Marker colorSwitching from red to blue caused up to a 30 % drop in accuracy for some models, while others improved, reshuffling the leaderboard.
Marker sizeSlightly enlarging the marker (by ~10 px) lifted the open‑source InternVL‑3‑8B to parity with Gemini 2.5 Pro on the original benchmark.
JPEG compressionVarying compression from quality 100 to 70 altered rankings for 5 of the 9 models, even though the visual content remained semantically identical.
Overall varianceAcross all perturbations, the standard deviation of model scores was 2–3× higher than on conventional (non‑prompted) VLM benchmarks.
VPBench impactWhen evaluated on the aggregated 16‑variant VPBench, the variance dropped by ≈45 %, and rankings became more stable across perturbations.

The key takeaway is that visual prompting introduces a hidden “visual prior” that models latch onto, making them vulnerable to low‑level visual cues that are irrelevant to the actual reasoning task.

Practical Implications

  • Benchmark design: Teams building VLM evaluation suites should avoid single‑variant visual prompts; instead, they should randomize marker attributes or use multiple variants (as VPBench does).
  • Model debugging: Developers can use the provided analysis tools to diagnose whether a model is over‑fitting to marker color/size rather than truly understanding the image content.
  • API usage: When calling commercial VLM APIs, be aware that image preprocessing (e.g., automatic JPEG compression) can unintentionally bias results—consider sending lossless formats or controlling compression levels.
  • Product reliability: Applications that rely on VLMs for visual QA (e.g., document analysis, medical imaging assistants) should not assume robustness to minor visual artifacts; thorough testing with varied prompts is essential.
  • Fair competition: Leaderboards that rank VLMs should disclose visual prompt specifications and possibly report robustness scores (performance averaged over multiple marker styles).

Limitations & Future Work

  • Scope of tasks: The study focuses on two visual‑prompting tasks; broader task families (e.g., video QA, multimodal reasoning) may exhibit different sensitivities.
  • Model diversity: While nine models were tested, the rapidly expanding VLM ecosystem means newer architectures could behave differently.
  • Human perception baseline: The paper does not compare model fragility to human performance on the same perturbed prompts, leaving open the question of whether the observed effects are uniquely machine‑centric.
  • Mitigation strategies: VPBench reduces variance but does not eliminate it; future work could explore training‑time regularization (e.g., marker‑agnostic data augmentation) to make models inherently robust.

By highlighting these gaps, the authors invite the community to develop more stable evaluation practices and to design VLMs that truly “see” beyond superficial visual cues.

Authors

  • Haiwen Feng
  • Long Lian
  • Lisa Dunlap
  • Jiahao Shu
  • XuDong Wang
  • Renhao Wang
  • Trevor Darrell
  • Alane Suhr
  • Angjoo Kanazawa

Paper Information

  • arXiv ID: 2512.17875v1
  • Categories: cs.CV, cs.LG
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »