[Paper] Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Published: 3 days ago (February 24, 2026 at 01:05 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.21160v1

Overview

The paper introduces a new way to break down epistemic uncertainty in deep‑learning classifiers. Instead of summarising a model’s ignorance with a single scalar (mutual information, MI), the authors propose a per‑class uncertainty vector that tells you which classes the model is unsure about. This finer‑grained view is especially valuable for safety‑critical applications where mistakes on certain classes (e.g., “cancer” vs. “benign”) carry very different costs.

Key Contributions

Per‑class decomposition of MI: Derives a closed‑form vector
[ C_k(x)=\frac{\sigma_k^{2}}{2\mu_k} ]
that approximates the contribution of each class (k) to the overall epistemic uncertainty.
Boundary‑aware weighting: The (1/\mu_k) factor corrects the tendency of traditional variance‑based metrics to under‑represent rare or low‑probability classes.
Skewness diagnostic: Provides a cheap check to flag inputs where the Taylor‑approximation (used to derive (C_k)) breaks down.
Axiomatic analysis: Shows that the per‑class scores satisfy desirable properties such as non‑negativity, additivity ((\sum_k C_k \approx \text{MI})), and invariance to label permutations.
Empirical validation on three fronts:
1. Selective prediction for diabetic retinopathy (DR) – improves risk‑reduction over standard MI and variance baselines.
2. Out‑of‑distribution (OOD) detection on clinical and natural‑image benchmarks – achieves the highest AUROC and reveals asymmetric distribution shifts invisible to scalar MI.
3. Label‑noise robustness – per‑class MI is less sensitive to injected aleatoric noise under end‑to‑end Bayesian training.

Methodology

Bayesian deep‑learning setup – The model is trained with a posterior over its weights (e.g., via Monte‑Carlo dropout or deep ensembles). For a given input (x), each posterior sample yields a predictive probability vector (\mathbf{p}^{(s)}).
Compute class‑wise moments:
- Mean probability for class (k): (\mu_k = \mathbb{E}[p_k]) (average over posterior samples).
- Variance for class (k): (\sigma_k^2 = \operatorname{Var}[p_k]).
Taylor‑expand the predictive entropy around the mean to obtain an approximation of the mutual information (MI) between model parameters and predictions. The second‑order term yields the per‑class contribution:
[ C_k(x) \approx \frac{\sigma_k^2}{2\mu_k}. ]
Summing across classes recovers the original MI up to higher‑order terms.
Skewness check – Compute the third central moment of the class probabilities; large skewness indicates the second‑order approximation may be unreliable, prompting a fallback to the full MI.
Evaluation pipelines – The authors plug the per‑class scores into existing decision‑making frameworks (selective prediction thresholds, OOD detectors, and noise‑sensitivity studies) and compare against baselines that use scalar MI or simple variance.

Results & Findings

Task	Metric	Improvement vs. MI	Notable Observation
Selective prediction (DR)	Risk reduction at 90 % coverage	34.7 % lower risk (critical‑class (C_k) vs. MI)	Targeting the “severe DR” class yields the biggest gains.
	vs. variance baseline	56.2 % lower risk	Variance alone over‑penalises easy classes.
OOD detection (clinical + ImageNet‑style)	AUROC (overall)	Highest among all tested scores (≈ 0.96)	Per‑class sum (\sum_k C_k) outperforms MI, variance, and entropy.
	Per‑class view	Reveals that OOD shift is dominated by a subset of classes (e.g., “malignant” in medical images)	Enables class‑specific alerts.
Label‑noise robustness	Sensitivity to injected noise (ΔAUROC)	Smaller drop for (\sum_k C_k) under end‑to‑end Bayesian training	Both MI and per‑class MI degrade when the posterior is approximated via transfer learning, highlighting the importance of a good posterior.

Across all experiments, the quality of the posterior approximation (how well the Bayesian inference captures weight uncertainty) proved as influential as the choice of uncertainty metric.

Practical Implications

Risk‑aware deployment: Developers can now set class‑specific confidence thresholds (e.g., stricter for “fire” vs. “smoke” in video surveillance) rather than a one‑size‑fits‑all cutoff.
Explainable alerts: When an OOD sample is flagged, the per‑class vector tells engineers which categories the model is confused about, simplifying root‑cause analysis.
Selective inference pipelines: In medical imaging or autonomous driving, you can automatically defer only the high‑risk predictions to a human reviewer, saving bandwidth while preserving safety.
Model debugging & data collection: High per‑class uncertainty on a rare class suggests the need for more labeled data or targeted augmentation for that class.
Compatibility: The method works with any Bayesian approximation that yields multiple predictive samples (dropout, ensembles, SWAG, etc.) and can be added as a post‑processing step without retraining the model.

Limitations & Future Work

Approximation accuracy: The per‑class scores rely on a second‑order Taylor expansion; extreme skewness in the predictive distribution can make the approximation unreliable, requiring the fallback skewness diagnostic.
Posterior dependence: The benefits diminish when the posterior is poorly approximated (e.g., naive transfer learning), indicating that the method is not a silver bullet for all Bayesian setups.
Scalability to ultra‑large vocabularies: While computationally cheap per sample, storing and processing a per‑class vector for thousands of classes (e.g., language models) may be memory‑intensive.
Future directions suggested by the authors include: extending the decomposition to hierarchical label spaces, integrating the per‑class uncertainty into loss functions for active learning, and exploring higher‑order expansions to tighten the MI approximation.

Authors

Mame Diarra Toure
David A. Stephens

Paper Information

arXiv ID: 2602.21160v1
Categories: stat.ML, cs.LG, stat.AP, stat.ME
Published: February 24, 2026
PDF: Download PDF

[Paper] Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport