[Paper] SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
Source: arXiv - 2604.25855v1
Overview
The paper introduces SIEVES – a framework that lets visual‑language models decide when to answer a question and when to “pass” by scoring the quality of the visual evidence they generate. By focusing on how well a model can localize relevant image regions, SIEVES dramatically boosts the fraction of inputs a system can safely handle (coverage) while keeping error rates within strict user‑defined limits, even on out‑of‑distribution (OOD) data.
Key Contributions
- Selective prediction via visual grounding – proposes a confidence estimator that judges the localization quality of a model’s visual evidence rather than relying on raw logits.
- Model‑agnostic selector – the SIEVES selector can be attached to any black‑box reasoner (including proprietary LLMs) without needing internal weights or logits.
- Strong OOD performance – achieves up to 3× higher coverage on five challenging OOD benchmarks (V* Bench, HR‑Bench‑8k, MME‑RealWorld‑Lite, VizWiz, AdVQA) compared with standard confidence‑based baselines.
- Zero‑shot transfer across reasoners – works with diverse visual reasoners (Pixel‑Reasoner, o3, Gemini‑3‑Pro) without any benchmark‑specific fine‑tuning.
- Practical risk control – lets developers set a target risk level (e.g., ≤ 5 % error) and automatically obtain the maximal set of inputs that satisfy it.
Methodology
- Reasoner produces visual evidence – any multimodal model that can output a heatmap or bounding‑box highlighting image regions used for its answer.
- Evidence Scoring Network (Selector) – a lightweight CNN‑based module trained to predict a quality score for the evidence. The training objective aligns the score with whether the answer is correct, using a small labeled validation set.
- Threshold‑based abstention – at inference time, the selector’s score is compared against a user‑defined threshold that corresponds to the acceptable risk. If the score is below the threshold, the system abstains; otherwise it returns the answer.
- Black‑box compatibility – because the selector only consumes the visual evidence (e.g., heatmaps) and the final answer, it can be plugged into any existing reasoner, even closed‑source APIs.
Results & Findings
| Benchmark | Baseline Coverage (at 5 % risk) | SIEVES Coverage | Relative Gain |
|---|---|---|---|
| V* Bench | 12 % | 35 % | +3× |
| HR‑Bench‑8k | 18 % | 48 % | +2.7× |
| MME‑RealWorld‑Lite | 22 % | 61 % | +2.8× |
| VizWiz | 15 % | 44 % | +2.9× |
| AdVQA | 20 % | 55 % | +2.8× |
- Accuracy remains stable – the abstained predictions are the ones most likely to be wrong, so the overall error rate stays within the target risk.
- Cross‑reasoner gains – attaching SIEVES to o3 and Gemini‑3‑Pro yields coverage improvements of 30‑40 % even though those models already have high raw accuracy.
- No per‑benchmark fine‑tuning – a single selector trained on a modest validation set generalizes to all five OOD datasets.
Practical Implications
- Safer deployment in production – developers can expose visual‑question‑answering APIs that automatically refuse to answer when confidence (via evidence quality) is low, reducing costly misclassifications in safety‑critical domains (e.g., medical imaging, autonomous inspection).
- Cost‑effective scaling – by abstaining on uncertain cases, a system can route those inputs to a human reviewer or a more expensive specialist model, optimizing compute budgets.
- Compatibility with closed‑source LLMs – SIEVES can be retro‑fitted to commercial vision‑language services (e.g., Gemini, GPT‑4V) without needing internal model access, making it a plug‑and‑play reliability layer.
- Improved user experience – end‑users receive a clear “I don’t know” response instead of a wrong answer, which is crucial for trust in AI assistants and customer‑support bots.
Limitations & Future Work
- Dependence on explicit visual evidence – models that do not output grounding maps cannot directly benefit from SIEVES; extending the selector to infer implicit evidence is an open challenge.
- Training data for the selector – while modest, the selector still requires a labeled set where correct/incorrect answers are known; gathering such data for niche domains may be non‑trivial.
- Threshold calibration – selecting the risk threshold can be dataset‑specific; future work could explore adaptive thresholds that auto‑tune based on streaming performance metrics.
- Broader modality coverage – the current study focuses on image‑based VQA; extending the approach to video, 3‑D data, or multimodal reasoning involving audio remains to be explored.
Authors
- Hector G. Rodriguez
- Marcus Rohrbach
Paper Information
- arXiv ID: 2604.25855v1
- Categories: cs.CV, cs.AI
- Published: April 28, 2026
- PDF: Download PDF