[Paper] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Published: 20 hours ago (April 28, 2026 at 12:57 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.25855v1

Overview

The paper introduces SIEVES – a framework that lets visual‑language models decide when to answer a question and when to “pass” by scoring the quality of the visual evidence they generate. By focusing on how well a model can localize relevant image regions, SIEVES dramatically boosts the fraction of inputs a system can safely handle (coverage) while keeping error rates within strict user‑defined limits, even on out‑of‑distribution (OOD) data.

Key Contributions

Selective prediction via visual grounding – proposes a confidence estimator that judges the localization quality of a model’s visual evidence rather than relying on raw logits.
Model‑agnostic selector – the SIEVES selector can be attached to any black‑box reasoner (including proprietary LLMs) without needing internal weights or logits.
Strong OOD performance – achieves up to 3× higher coverage on five challenging OOD benchmarks (V* Bench, HR‑Bench‑8k, MME‑RealWorld‑Lite, VizWiz, AdVQA) compared with standard confidence‑based baselines.
Zero‑shot transfer across reasoners – works with diverse visual reasoners (Pixel‑Reasoner, o3, Gemini‑3‑Pro) without any benchmark‑specific fine‑tuning.
Practical risk control – lets developers set a target risk level (e.g., ≤ 5 % error) and automatically obtain the maximal set of inputs that satisfy it.

Methodology

Reasoner produces visual evidence – any multimodal model that can output a heatmap or bounding‑box highlighting image regions used for its answer.
Evidence Scoring Network (Selector) – a lightweight CNN‑based module trained to predict a quality score for the evidence. The training objective aligns the score with whether the answer is correct, using a small labeled validation set.
Threshold‑based abstention – at inference time, the selector’s score is compared against a user‑defined threshold that corresponds to the acceptable risk. If the score is below the threshold, the system abstains; otherwise it returns the answer.
Black‑box compatibility – because the selector only consumes the visual evidence (e.g., heatmaps) and the final answer, it can be plugged into any existing reasoner, even closed‑source APIs.

Results & Findings

Benchmark	Baseline Coverage (at 5 % risk)	SIEVES Coverage	Relative Gain
V* Bench	12 %	35 %	+3×
HR‑Bench‑8k	18 %	48 %	+2.7×
MME‑RealWorld‑Lite	22 %	61 %	+2.8×
VizWiz	15 %	44 %	+2.9×
AdVQA	20 %	55 %	+2.8×

Accuracy remains stable – the abstained predictions are the ones most likely to be wrong, so the overall error rate stays within the target risk.
Cross‑reasoner gains – attaching SIEVES to o3 and Gemini‑3‑Pro yields coverage improvements of 30‑40 % even though those models already have high raw accuracy.
No per‑benchmark fine‑tuning – a single selector trained on a modest validation set generalizes to all five OOD datasets.

Practical Implications

Safer deployment in production – developers can expose visual‑question‑answering APIs that automatically refuse to answer when confidence (via evidence quality) is low, reducing costly misclassifications in safety‑critical domains (e.g., medical imaging, autonomous inspection).
Cost‑effective scaling – by abstaining on uncertain cases, a system can route those inputs to a human reviewer or a more expensive specialist model, optimizing compute budgets.
Compatibility with closed‑source LLMs – SIEVES can be retro‑fitted to commercial vision‑language services (e.g., Gemini, GPT‑4V) without needing internal model access, making it a plug‑and‑play reliability layer.
Improved user experience – end‑users receive a clear “I don’t know” response instead of a wrong answer, which is crucial for trust in AI assistants and customer‑support bots.

Limitations & Future Work

Dependence on explicit visual evidence – models that do not output grounding maps cannot directly benefit from SIEVES; extending the selector to infer implicit evidence is an open challenge.
Training data for the selector – while modest, the selector still requires a labeled set where correct/incorrect answers are known; gathering such data for niche domains may be non‑trivial.
Threshold calibration – selecting the risk threshold can be dataset‑specific; future work could explore adaptive thresholds that auto‑tune based on streaming performance metrics.
Broader modality coverage – the current study focuses on image‑based VQA; extending the approach to video, 3‑D data, or multimodal reasoning involving audio remains to be explored.

Authors

Hector G. Rodriguez
Marcus Rohrbach

Paper Information

arXiv ID: 2604.25855v1
Categories: cs.CV, cs.AI
Published: April 28, 2026
PDF: Download PDF

[Paper] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

[Paper] Improving Diversity in Black-box Few-shot Knowledge Distillation

[Paper] Diverse Image Priors for Black-box Data-free Knowledge Distillation

[Paper] Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction