[Paper] DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

Published: 2 weeks ago (January 5, 2026 at 01:07 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02316v1

Overview

The paper “DatBench: Discriminative, Faithful, and Efficient VLM Evaluations” tackles a surprisingly sticky problem in the booming field of vision‑language models (VLMs): how do we reliably measure how good these models really are? The authors argue that many popular benchmarks are misleading, wasteful, or both, and they introduce a revamped evaluation suite—DatBench—that is more truthful to real‑world use cases, better at spotting performance differences, and dramatically cheaper to run.

Key Contributions

Three evaluation desiderata – faithfulness, discriminability, and efficiency – defined as the gold standard for VLM benchmarks.
Systematic audit of existing VLM tests, exposing three major failure modes: (i) multiple‑choice formats that encourage guessing, (ii) “blindly solvable” items that don’t need the image, and (iii) mislabeled or ambiguous samples.
Data‑centric remediation pipeline that (a) converts multiple‑choice questions into generative prompts, (b) filters out blind‑solve and noisy examples, (c) curates a clean, high‑quality subset.
DatBench‑Full – a comprehensive suite of 33 datasets covering nine VLM capabilities (e.g., object grounding, visual reasoning, captioning).
DatBench (compact) – a distilled, discriminative subset that delivers up to 50× speed‑up (average 13×) while preserving the ability to separate models of different quality.
Empirical evidence that the cleaned benchmarks reveal capability gaps of up to 35 % that were hidden in the original tests.

Methodology

Failure‑Mode Diagnosis – Quantified how many items in popular VLM benchmarks could be answered correctly without looking at the image (up to 70 % in some cases) and measured label noise (up to 42 %).
Transformation – Re‑phrased multiple‑choice questions as open‑ended generation tasks (e.g., “What is shown in the image?”) so models can’t rely on answer‑option elimination.
Filtering – Used a lightweight “blind‑solver” (a language‑only model) to flag and remove items solvable without visual input. Human verification caught ambiguous or mislabeled cases.
Benchmark Assembly – Grouped cleaned items into nine capability buckets (e.g., VQA, visual entailment, region grounding). Released two versions: a full, exhaustive set and a compact, high‑discriminability subset selected via greedy optimization for maximal model separation per compute unit.
Evaluation Protocol – Ran standard VLMs (e.g., CLIP‑based, Flamingo, LLaVA) on both the original and DatBench versions, recording performance drops, compute time, and discriminative scores (e.g., pairwise rank correlation).

Results & Findings

Aspect	Original Benchmarks	DatBench‑Full	DatBench (compact)
Average accuracy drop (after conversion to generative)	–	‑35 % (max)	‑30 % (typical)
Blind‑solve rate	Up to 70 %	< 5 %	< 5 %
Label‑noise rate	Up to 42 %	< 2 %	< 2 %
Compute cost (GPU‑hours per model)	1× (baseline)	1× (same)	0.07× (≈13× faster)
Discriminability (Spearman rank correlation across models)	0.62	0.78	0.75

What this means: When the same VLMs are evaluated on the cleaned, generative version, their scores fall sharply, exposing hidden weaknesses. At the same time, the compact DatBench keeps almost the same ordering of models while slashing evaluation time dramatically.

Practical Implications

R&D pipelines become leaner – Teams can now run a full VLM evaluation suite in a fraction of the time, freeing up compute for model training and iteration.
More trustworthy model selection – Benchmarks faithful to real‑world tasks (no guessing, no image‑free shortcuts) let product engineers trust that a high score translates to downstream performance (e.g., e‑commerce visual search or AI‑assisted design tools).
Benchmark‑driven product roadmaps – The nine capability categories map cleanly to common application domains (captioning, visual QA, grounding). Companies can prioritize improvements where DatBench shows the biggest gaps.
Open‑source community standard – By releasing both the full and compact versions, the authors provide a drop‑in replacement for widely used VLM testbeds, encouraging reproducibility and fair competition.
Cost savings at scale – For large labs evaluating dozens of model variants, a 13× speed‑up translates to millions of dollars saved annually in GPU compute.

Limitations & Future Work

Scope of modalities – DatBench focuses on static images paired with text; video‑language or multimodal audio‑visual tasks are not covered.
Human verification bottleneck – While the blind‑solver filter is automated, cleaning ambiguous labels still requires manual effort, which may not scale to new datasets without additional tooling.
Generative evaluation metrics – Converting to open‑ended generation introduces reliance on language‑model scoring (e.g., BLEU, ROUGE) that can be noisy; more robust similarity measures (e.g., CLIPScore) could be explored.
Dynamic benchmark evolution – As VLMs become capable of reasoning beyond the curated datasets, future work should investigate adversarial or out‑of‑distribution test cases to keep the evaluation challenging.

Bottom line: DatBench offers a pragmatic, data‑centric answer to the growing pains of VLM evaluation, delivering clearer insight into model strengths while slashing the compute bill—a win for both researchers and industry practitioners.

Authors

Siddharth Joshi
Haoli Yin
Rishabh Adiga
Ricardo Monti
Aldo Carranza
Alex Fang
Alvin Deng
Amro Abbas
Brett Larsen
Cody Blakeney
Darren Teh
David Schwab
Fan Pan
Haakon Mongstad
Jack Urbanek
Jason Lee
Jason Telanoff
Josh Wills
Kaleigh Mentzer
Luke Merrick
Parth Doshi
Paul Burstein
Pratyush Maini
Scott Loftin
Spandan Das
Tony Jiang
Vineeth Dorna
Zhengping Wang
Bogdan Gaza
Ari Morcos
Matthew Leavitt

Paper Information

arXiv ID: 2601.02316v1
Categories: cs.LG, cs.AI
Published: January 5, 2026
PDF: Download PDF

[Paper] DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management