[Paper] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Published: (April 28, 2026 at 12:57 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.25855v1

Overview

The paper introduces SIEVES – a framework that lets visual‑language models decide when to answer a question and when to “pass” by scoring the quality of the visual evidence they generate. By focusing on how well a model can localize relevant image regions, SIEVES dramatically boosts the fraction of inputs a system can safely handle (coverage) while keeping error rates within strict user‑defined limits, even on out‑of‑distribution (OOD) data.

Key Contributions

  • Selective prediction via visual grounding – proposes a confidence estimator that judges the localization quality of a model’s visual evidence rather than relying on raw logits.
  • Model‑agnostic selector – the SIEVES selector can be attached to any black‑box reasoner (including proprietary LLMs) without needing internal weights or logits.
  • Strong OOD performance – achieves up to 3× higher coverage on five challenging OOD benchmarks (V* Bench, HR‑Bench‑8k, MME‑RealWorld‑Lite, VizWiz, AdVQA) compared with standard confidence‑based baselines.
  • Zero‑shot transfer across reasoners – works with diverse visual reasoners (Pixel‑Reasoner, o3, Gemini‑3‑Pro) without any benchmark‑specific fine‑tuning.
  • Practical risk control – lets developers set a target risk level (e.g., ≤ 5 % error) and automatically obtain the maximal set of inputs that satisfy it.

Methodology

  1. Reasoner produces visual evidence – any multimodal model that can output a heatmap or bounding‑box highlighting image regions used for its answer.
  2. Evidence Scoring Network (Selector) – a lightweight CNN‑based module trained to predict a quality score for the evidence. The training objective aligns the score with whether the answer is correct, using a small labeled validation set.
  3. Threshold‑based abstention – at inference time, the selector’s score is compared against a user‑defined threshold that corresponds to the acceptable risk. If the score is below the threshold, the system abstains; otherwise it returns the answer.
  4. Black‑box compatibility – because the selector only consumes the visual evidence (e.g., heatmaps) and the final answer, it can be plugged into any existing reasoner, even closed‑source APIs.

Results & Findings

BenchmarkBaseline Coverage (at 5 % risk)SIEVES CoverageRelative Gain
V* Bench12 %35 %+3×
HR‑Bench‑8k18 %48 %+2.7×
MME‑RealWorld‑Lite22 %61 %+2.8×
VizWiz15 %44 %+2.9×
AdVQA20 %55 %+2.8×
  • Accuracy remains stable – the abstained predictions are the ones most likely to be wrong, so the overall error rate stays within the target risk.
  • Cross‑reasoner gains – attaching SIEVES to o3 and Gemini‑3‑Pro yields coverage improvements of 30‑40 % even though those models already have high raw accuracy.
  • No per‑benchmark fine‑tuning – a single selector trained on a modest validation set generalizes to all five OOD datasets.

Practical Implications

  • Safer deployment in production – developers can expose visual‑question‑answering APIs that automatically refuse to answer when confidence (via evidence quality) is low, reducing costly misclassifications in safety‑critical domains (e.g., medical imaging, autonomous inspection).
  • Cost‑effective scaling – by abstaining on uncertain cases, a system can route those inputs to a human reviewer or a more expensive specialist model, optimizing compute budgets.
  • Compatibility with closed‑source LLMs – SIEVES can be retro‑fitted to commercial vision‑language services (e.g., Gemini, GPT‑4V) without needing internal model access, making it a plug‑and‑play reliability layer.
  • Improved user experience – end‑users receive a clear “I don’t know” response instead of a wrong answer, which is crucial for trust in AI assistants and customer‑support bots.

Limitations & Future Work

  • Dependence on explicit visual evidence – models that do not output grounding maps cannot directly benefit from SIEVES; extending the selector to infer implicit evidence is an open challenge.
  • Training data for the selector – while modest, the selector still requires a labeled set where correct/incorrect answers are known; gathering such data for niche domains may be non‑trivial.
  • Threshold calibration – selecting the risk threshold can be dataset‑specific; future work could explore adaptive thresholds that auto‑tune based on streaming performance metrics.
  • Broader modality coverage – the current study focuses on image‑based VQA; extending the approach to video, 3‑D data, or multimodal reasoning involving audio remains to be explored.

Authors

  • Hector G. Rodriguez
  • Marcus Rohrbach

Paper Information

  • arXiv ID: 2604.25855v1
  • Categories: cs.CV, cs.AI
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »