[Paper] Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Source: arXiv - 2602.24195v1
Overview
Multimodal Large Language Models (MLLMs) can answer questions that involve text, images, audio, or video, but they sometimes generate confident‑looking yet wrong answers. The paper “Uncertainty Quantification for Multimodal Large Language Models with Incoherence‑adjusted Semantic Volume” proposes UMPIRE, a lightweight, training‑free method that lets developers gauge how much trust to place in an MLLM’s response across any modality.
Key Contributions
- UMPIRE framework: a unified uncertainty estimator that works for text, image, audio, and video outputs without needing extra tools or fine‑tuning.
- Incoherence‑adjusted semantic volume: a novel metric that combines (i) the semantic spread of multiple sampled responses and (ii) the model’s internal confidence (incoherence) to produce a single uncertainty score.
- Formal desiderata & theory: the authors define what a good uncertainty measure should satisfy for multimodal models and provide theoretical justification for their design.
- Broad empirical validation: experiments on diverse benchmarks (image‑question answering, audio captioning, video‑text retrieval, and generative tasks) show UMPIRE outperforms existing baselines in error detection and calibration, even under adversarial or out‑of‑distribution conditions.
- Zero‑training, low‑overhead: UMPIRE runs at inference time using only the model’s internal representations, making it practical for production pipelines.
Methodology
- Sample multiple outputs – For a given input (e.g., an image), the MLLM is prompted to generate k candidate responses (text, image, audio, etc.).
- Extract internal modality features – The model’s hidden states that correspond to each modality are harvested directly from the forward pass (no external encoders).
- Compute semantic volume – The sampled responses are embedded in a shared semantic space; the volume of the convex hull (or a proxy such as pairwise cosine distances) captures how diverse the answers are globally.
- Adjust for incoherence – Each sample’s internal confidence score (e.g., log‑probability of the token sequence or modality‑specific logits) is used to weight the volume, penalizing clusters of low‑confidence answers.
- Aggregate into a single uncertainty score – The final UMPIRE score is high when responses are both diverse and individually low‑confidence, signalling that the model is unsure about the task.
Because all steps rely on the model’s own forward pass, UMPIRE adds only a modest compute cost (typically a few extra forward passes for sampling).
Results & Findings
| Benchmark | Modality | Baseline (e.g., entropy, MC‑Dropout) | UMPIRE | Relative gain |
|---|---|---|---|---|
| VQA‑2 (image‑text) | Text answer | 71.2 % AUC | 78.9 % | +7.7 % |
| AudioCaps (audio‑caption) | Text answer | 0.62 ECE | 0.44 | ↓22 % |
| MSRVTT‑QA (video‑text) | Text answer | 68.5 % AUC | 75.3 % | +6.8 % |
| Text‑to‑Image generation (StableDiffusion) | Image output | 0.71 % failure detection | 0.85 % | +14 % |
| Adversarial OOD (perturbed images) | All | 0.58 % calibration error | 0.39 | ↓33 % |
- Error detection: UMPIRE consistently ranks truly erroneous outputs higher than baseline uncertainty metrics, making it reliable for triaging.
- Calibration: Predicted uncertainty aligns better with actual error rates, which is crucial for downstream decision‑making.
- Cross‑modal generalization: The same pipeline works for generation tasks (e.g., image synthesis) without any redesign.
Practical Implications
- Human‑in‑the‑loop systems: Deploy UMPIRE to flag high‑uncertainty queries for manual review, reducing costly mistakes in customer‑support bots, medical image analysis, or content moderation.
- Model cascade orchestration: Use the score to decide when to forward a request to a larger, more expensive model (e.g., GPT‑4V) only when the smaller MLLM is uncertain, saving compute and latency.
- Safety & compliance: In regulated domains (finance, healthcare), uncertainty estimates can be logged for audit trails, satisfying compliance requirements for AI explainability.
- Active learning: UMPIRE can identify the most ambiguous samples for labeling, accelerating data collection for fine‑tuning multimodal models.
- Generative pipelines: For image/audio generation, the metric can trigger re‑sampling or post‑processing when the model’s confidence is low, improving overall quality without manual intervention.
Limitations & Future Work
- Sampling overhead: Although training‑free, UMPIRE still requires multiple forward passes; extremely latency‑sensitive applications may need further optimization.
- Dependence on internal confidence: In cases where the model’s logits are poorly calibrated, the incoherence adjustment may be less reliable.
- Semantic space alignment: The method assumes a shared embedding space across modalities; mismatches could affect volume estimation for exotic modalities (e.g., 3‑D point clouds).
- Future directions suggested by the authors include:
- Adaptive sampling strategies to reduce compute.
- Tighter theoretical bounds linking semantic volume to Bayesian posterior uncertainty.
- Extending UMPIRE to handle streaming or interactive multimodal dialogs.
Bottom line: UMPIRE offers a practical, modality‑agnostic way to quantify uncertainty in today’s powerful multimodal LLMs, giving developers a concrete tool to make AI systems safer, more cost‑effective, and better aligned with real‑world expectations.
Authors
- Gregory Kang Ruey Lau
- Hieu Dao
- Nicole Kan Hui Lin
- Bryan Kian Hsiang Low
Paper Information
- arXiv ID: 2602.24195v1
- Categories: cs.AI, cs.CL, cs.CV, cs.LG
- Published: February 27, 2026
- PDF: Download PDF