[Paper] DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
Source: arXiv - 2601.02316v1
Overview
The paper “DatBench: Discriminative, Faithful, and Efficient VLM Evaluations” tackles a surprisingly sticky problem in the booming field of vision‑language models (VLMs): how do we reliably measure how good these models really are? The authors argue that many popular benchmarks are misleading, wasteful, or both, and they introduce a revamped evaluation suite—DatBench—that is more truthful to real‑world use cases, better at spotting performance differences, and dramatically cheaper to run.
Key Contributions
- Three evaluation desiderata – faithfulness, discriminability, and efficiency – defined as the gold standard for VLM benchmarks.
- Systematic audit of existing VLM tests, exposing three major failure modes: (i) multiple‑choice formats that encourage guessing, (ii) “blindly solvable” items that don’t need the image, and (iii) mislabeled or ambiguous samples.
- Data‑centric remediation pipeline that (a) converts multiple‑choice questions into generative prompts, (b) filters out blind‑solve and noisy examples, (c) curates a clean, high‑quality subset.
- DatBench‑Full – a comprehensive suite of 33 datasets covering nine VLM capabilities (e.g., object grounding, visual reasoning, captioning).
- DatBench (compact) – a distilled, discriminative subset that delivers up to 50× speed‑up (average 13×) while preserving the ability to separate models of different quality.
- Empirical evidence that the cleaned benchmarks reveal capability gaps of up to 35 % that were hidden in the original tests.
Methodology
- Failure‑Mode Diagnosis – Quantified how many items in popular VLM benchmarks could be answered correctly without looking at the image (up to 70 % in some cases) and measured label noise (up to 42 %).
- Transformation – Re‑phrased multiple‑choice questions as open‑ended generation tasks (e.g., “What is shown in the image?”) so models can’t rely on answer‑option elimination.
- Filtering – Used a lightweight “blind‑solver” (a language‑only model) to flag and remove items solvable without visual input. Human verification caught ambiguous or mislabeled cases.
- Benchmark Assembly – Grouped cleaned items into nine capability buckets (e.g., VQA, visual entailment, region grounding). Released two versions: a full, exhaustive set and a compact, high‑discriminability subset selected via greedy optimization for maximal model separation per compute unit.
- Evaluation Protocol – Ran standard VLMs (e.g., CLIP‑based, Flamingo, LLaVA) on both the original and DatBench versions, recording performance drops, compute time, and discriminative scores (e.g., pairwise rank correlation).
Results & Findings
| Aspect | Original Benchmarks | DatBench‑Full | DatBench (compact) |
|---|---|---|---|
| Average accuracy drop (after conversion to generative) | – | ‑35 % (max) | ‑30 % (typical) |
| Blind‑solve rate | Up to 70 % | < 5 % | < 5 % |
| Label‑noise rate | Up to 42 % | < 2 % | < 2 % |
| Compute cost (GPU‑hours per model) | 1× (baseline) | 1× (same) | 0.07× (≈13× faster) |
| Discriminability (Spearman rank correlation across models) | 0.62 | 0.78 | 0.75 |
What this means: When the same VLMs are evaluated on the cleaned, generative version, their scores fall sharply, exposing hidden weaknesses. At the same time, the compact DatBench keeps almost the same ordering of models while slashing evaluation time dramatically.
Practical Implications
- R&D pipelines become leaner – Teams can now run a full VLM evaluation suite in a fraction of the time, freeing up compute for model training and iteration.
- More trustworthy model selection – Benchmarks faithful to real‑world tasks (no guessing, no image‑free shortcuts) let product engineers trust that a high score translates to downstream performance (e.g., e‑commerce visual search or AI‑assisted design tools).
- Benchmark‑driven product roadmaps – The nine capability categories map cleanly to common application domains (captioning, visual QA, grounding). Companies can prioritize improvements where DatBench shows the biggest gaps.
- Open‑source community standard – By releasing both the full and compact versions, the authors provide a drop‑in replacement for widely used VLM testbeds, encouraging reproducibility and fair competition.
- Cost savings at scale – For large labs evaluating dozens of model variants, a 13× speed‑up translates to millions of dollars saved annually in GPU compute.
Limitations & Future Work
- Scope of modalities – DatBench focuses on static images paired with text; video‑language or multimodal audio‑visual tasks are not covered.
- Human verification bottleneck – While the blind‑solver filter is automated, cleaning ambiguous labels still requires manual effort, which may not scale to new datasets without additional tooling.
- Generative evaluation metrics – Converting to open‑ended generation introduces reliance on language‑model scoring (e.g., BLEU, ROUGE) that can be noisy; more robust similarity measures (e.g., CLIPScore) could be explored.
- Dynamic benchmark evolution – As VLMs become capable of reasoning beyond the curated datasets, future work should investigate adversarial or out‑of‑distribution test cases to keep the evaluation challenging.
Bottom line: DatBench offers a pragmatic, data‑centric answer to the growing pains of VLM evaluation, delivering clearer insight into model strengths while slashing the compute bill—a win for both researchers and industry practitioners.
Authors
- Siddharth Joshi
- Haoli Yin
- Rishabh Adiga
- Ricardo Monti
- Aldo Carranza
- Alex Fang
- Alvin Deng
- Amro Abbas
- Brett Larsen
- Cody Blakeney
- Darren Teh
- David Schwab
- Fan Pan
- Haakon Mongstad
- Jack Urbanek
- Jason Lee
- Jason Telanoff
- Josh Wills
- Kaleigh Mentzer
- Luke Merrick
- Parth Doshi
- Paul Burstein
- Pratyush Maini
- Scott Loftin
- Spandan Das
- Tony Jiang
- Vineeth Dorna
- Zhengping Wang
- Bogdan Gaza
- Ari Morcos
- Matthew Leavitt
Paper Information
- arXiv ID: 2601.02316v1
- Categories: cs.LG, cs.AI
- Published: January 5, 2026
- PDF: Download PDF