[Paper] Multimodal Large Language Models as Image Classifiers

Published: 3 days ago (March 6, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.06578v1

Overview

The paper “Multimodal Large Language Models as Image Classifiers” investigates why recent studies report wildly different classification results for multimodal large language models (MLLMs). The authors show that most of the discrepancy comes from flawed evaluation protocols and noisy ground‑truth labels rather than from the models themselves. By fixing these issues, they reveal that MLLMs are far closer to supervised vision models than previously thought and can even help human annotators curate large image datasets.

Key Contributions

Systematic audit of evaluation protocols – identifies three common sources of bias: (1) discarding model outputs that fall outside the predefined class list, (2) using overly easy multiple‑choice distractors, and (3) poor mapping of open‑world predictions to class IDs.
Quantitative analysis of “hidden” design choices – demonstrates that batch size, image ordering, and the choice of text encoder can swing accuracy by several percentage points.
ReGT dataset – introduces a multilabel re‑annotation of 625 ImageNet‑1k classes (ReGT) that corrects noisy labels and provides a more reliable benchmark for MLLMs.
Performance gap shrinkage – shows that with ReGT, MLLMs gain up to +10.8 % absolute accuracy, narrowing the gap to fully supervised vision models.
Human‑in‑the‑loop study – proves that annotators accept or incorporate MLLM predictions in roughly 50 % of difficult cases, highlighting the models’ utility for large‑scale dataset curation.

Methodology

Benchmark audit – The authors re‑implemented the most common classification pipelines for MLLMs (zero‑shot prompting, multiple‑choice, and open‑world settings) and measured how each step influences the final score.
Error‑type analysis – They categorized failures into “out‑of‑list” predictions, “weak distractor” choices, and “mapping errors” to pinpoint where performance was artificially inflated or deflated.
Design‑choice experiments – By varying batch size (1–64), shuffling image order, and swapping the text encoder (e.g., CLIP‑text vs. LLaMA‑based), they recorded the resulting accuracy changes.
ReGT creation – A team of experts re‑labeled a subset of ImageNet‑1k, allowing multiple correct labels per image (multilabel). This dataset serves as a cleaner ground truth for evaluation.
Human‑MLLM annotation study – In a controlled experiment, annotators were shown images together with the model’s top‑k predictions; they either accepted the suggestion, edited it, or rejected it. The acceptance rate was recorded.

All steps are described with enough detail that a practitioner could reproduce the experiments using publicly available MLLM checkpoints (e.g., LLaVA, MiniGPT‑4) and the released ReGT annotations.

Results & Findings

Setting	Baseline Accuracy (ImageNet‑1k)	Accuracy after fixing protocol	Gain with ReGT
Zero‑shot multiple‑choice (original)	38.2 %	44.7 % (↑6.5 pp)	–
Open‑world mapping (original)	31.5 %	40.9 % (↑9.4 pp)	–
Zero‑shot with ReGT	–	–	+10.8 pp (up to 55 % total)
Human‑MLLM assisted labeling	–	–	≈50 % of difficult cases accepted model suggestion

Key takeaways

Protocol fixes alone recover 6–9 pp of accuracy, proving that many “failures” were evaluation artefacts.
Cleaner labels (ReGT) provide the biggest boost, confirming that noisy ImageNet labels penalize MLLMs more than fully supervised CNNs.
Design choices matter: larger batch sizes and stable image ordering improve consistency; the choice of text encoder can swing results by up to 3 pp.
MLLMs are practical annotators: in half of the hard cases, the model’s suggestion was good enough for a human to adopt it without further work.

Practical Implications

More reliable benchmarking – Teams building vision‑language products can adopt the authors’ corrected protocols (e.g., keep out‑of‑list predictions, use stronger distractors) to get a true sense of model capability.
Dataset curation at scale – Companies that need to label millions of images (e.g., e‑commerce, social media) can integrate MLLMs into their annotation pipelines, cutting human effort roughly in half for ambiguous items.
Model selection guidance – When choosing an MLLM for classification, prioritize models that rely less on supervised vision pre‑training (they benefit most from clean labels).
Fine‑tuning strategies – The sensitivity to batch size and image order suggests that even lightweight fine‑tuning or prompt‑engineering can yield noticeable gains without full retraining.
Open‑world applications – By improving the mapping from free‑form model output to a target taxonomy, developers can build more flexible image‑search or content‑moderation systems that handle novel classes gracefully.

Limitations & Future Work

Scope of ReGT – The re‑annotation covers only 625 of the 1,000 ImageNet classes; extending this to the full set (or to other domains) would further validate the findings.
Model diversity – Experiments focus on a handful of publicly released MLLMs; newer or larger multimodal models may exhibit different sensitivities.
Human study size – The annotation experiment involved a limited number of annotators and image samples; larger user studies are needed to confirm the 50 % acceptance rate in production settings.
Real‑time constraints – The impact of batch size and ordering on latency was not explored; future work should balance accuracy gains against inference speed for deployment.

By addressing these points, the community can solidify MLLMs as dependable image classifiers and annotation assistants in real‑world pipelines.

Authors

Nikita Kisel
Illia Volkov
Klara Janouskova
Jiri Matas

Paper Information

arXiv ID: 2603.06578v1
Categories: cs.CV
Published: March 6, 2026
PDF: Download PDF

[Paper] Multimodal Large Language Models as Image Classifiers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

[Paper] BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

[Paper] SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

[Paper] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning