[Paper] Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Published: (February 24, 2026 at 01:05 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.21160v1

Overview

The paper introduces a new way to break down epistemic uncertainty in deep‑learning classifiers. Instead of summarising a model’s ignorance with a single scalar (mutual information, MI), the authors propose a per‑class uncertainty vector that tells you which classes the model is unsure about. This finer‑grained view is especially valuable for safety‑critical applications where mistakes on certain classes (e.g., “cancer” vs. “benign”) carry very different costs.

Key Contributions

  • Per‑class decomposition of MI: Derives a closed‑form vector
    [ C_k(x)=\frac{\sigma_k^{2}}{2\mu_k} ]
    that approximates the contribution of each class (k) to the overall epistemic uncertainty.
  • Boundary‑aware weighting: The (1/\mu_k) factor corrects the tendency of traditional variance‑based metrics to under‑represent rare or low‑probability classes.
  • Skewness diagnostic: Provides a cheap check to flag inputs where the Taylor‑approximation (used to derive (C_k)) breaks down.
  • Axiomatic analysis: Shows that the per‑class scores satisfy desirable properties such as non‑negativity, additivity ((\sum_k C_k \approx \text{MI})), and invariance to label permutations.
  • Empirical validation on three fronts:
    1. Selective prediction for diabetic retinopathy (DR) – improves risk‑reduction over standard MI and variance baselines.
    2. Out‑of‑distribution (OOD) detection on clinical and natural‑image benchmarks – achieves the highest AUROC and reveals asymmetric distribution shifts invisible to scalar MI.
    3. Label‑noise robustness – per‑class MI is less sensitive to injected aleatoric noise under end‑to‑end Bayesian training.

Methodology

  1. Bayesian deep‑learning setup – The model is trained with a posterior over its weights (e.g., via Monte‑Carlo dropout or deep ensembles). For a given input (x), each posterior sample yields a predictive probability vector (\mathbf{p}^{(s)}).
  2. Compute class‑wise moments:
    • Mean probability for class (k): (\mu_k = \mathbb{E}[p_k]) (average over posterior samples).
    • Variance for class (k): (\sigma_k^2 = \operatorname{Var}[p_k]).
  3. Taylor‑expand the predictive entropy around the mean to obtain an approximation of the mutual information (MI) between model parameters and predictions. The second‑order term yields the per‑class contribution:
    [ C_k(x) \approx \frac{\sigma_k^2}{2\mu_k}. ]
    Summing across classes recovers the original MI up to higher‑order terms.
  4. Skewness check – Compute the third central moment of the class probabilities; large skewness indicates the second‑order approximation may be unreliable, prompting a fallback to the full MI.
  5. Evaluation pipelines – The authors plug the per‑class scores into existing decision‑making frameworks (selective prediction thresholds, OOD detectors, and noise‑sensitivity studies) and compare against baselines that use scalar MI or simple variance.

Results & Findings

TaskMetricImprovement vs. MINotable Observation
Selective prediction (DR)Risk reduction at 90 % coverage34.7 % lower risk (critical‑class (C_k) vs. MI)Targeting the “severe DR” class yields the biggest gains.
vs. variance baseline56.2 % lower riskVariance alone over‑penalises easy classes.
OOD detection (clinical + ImageNet‑style)AUROC (overall)Highest among all tested scores (≈ 0.96)Per‑class sum (\sum_k C_k) outperforms MI, variance, and entropy.
Per‑class viewReveals that OOD shift is dominated by a subset of classes (e.g., “malignant” in medical images)Enables class‑specific alerts.
Label‑noise robustnessSensitivity to injected noise (ΔAUROC)Smaller drop for (\sum_k C_k) under end‑to‑end Bayesian trainingBoth MI and per‑class MI degrade when the posterior is approximated via transfer learning, highlighting the importance of a good posterior.

Across all experiments, the quality of the posterior approximation (how well the Bayesian inference captures weight uncertainty) proved as influential as the choice of uncertainty metric.

Practical Implications

  • Risk‑aware deployment: Developers can now set class‑specific confidence thresholds (e.g., stricter for “fire” vs. “smoke” in video surveillance) rather than a one‑size‑fits‑all cutoff.
  • Explainable alerts: When an OOD sample is flagged, the per‑class vector tells engineers which categories the model is confused about, simplifying root‑cause analysis.
  • Selective inference pipelines: In medical imaging or autonomous driving, you can automatically defer only the high‑risk predictions to a human reviewer, saving bandwidth while preserving safety.
  • Model debugging & data collection: High per‑class uncertainty on a rare class suggests the need for more labeled data or targeted augmentation for that class.
  • Compatibility: The method works with any Bayesian approximation that yields multiple predictive samples (dropout, ensembles, SWAG, etc.) and can be added as a post‑processing step without retraining the model.

Limitations & Future Work

  • Approximation accuracy: The per‑class scores rely on a second‑order Taylor expansion; extreme skewness in the predictive distribution can make the approximation unreliable, requiring the fallback skewness diagnostic.
  • Posterior dependence: The benefits diminish when the posterior is poorly approximated (e.g., naive transfer learning), indicating that the method is not a silver bullet for all Bayesian setups.
  • Scalability to ultra‑large vocabularies: While computationally cheap per sample, storing and processing a per‑class vector for thousands of classes (e.g., language models) may be memory‑intensive.
  • Future directions suggested by the authors include: extending the decomposition to hierarchical label spaces, integrating the per‑class uncertainty into loss functions for active learning, and exploring higher‑order expansions to tighten the MI approximation.

Authors

  • Mame Diarra Toure
  • David A. Stephens

Paper Information

  • arXiv ID: 2602.21160v1
  • Categories: stat.ML, cs.LG, stat.AP, stat.ME
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...