[Paper] JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks
Source: arXiv - 2602.20153v1
Overview
The paper introduces JUCAL, a lightweight post‑processing technique that simultaneously calibrates aleatoric (inherent data noise) and epistemic (model‑related) uncertainty for any pre‑trained ensemble of classifiers. By balancing these two sources of uncertainty, JUCAL delivers far more reliable confidence estimates than traditional temperature scaling or conformal methods, cutting negative log‑likelihood (NLL) and predictive set sizes by up to 15 % and 20 % respectively.
Key Contributions
- Joint calibration framework that learns two scalar parameters (a weight and a scale) to balance aleatoric and epistemic uncertainty.
- Model‑agnostic: works with any ensemble architecture (transformers, CNNs, gradient‑boosted trees, etc.) without needing internal weights or gradients.
- Simple optimization: parameters are fitted by minimizing NLL on a held‑out validation set, adding negligible computational overhead.
- Empirical superiority: across multiple text classification benchmarks, JUCAL consistently outperforms temperature scaling, isotonic regression, and conformal calibration in both NLL and predictive set size.
- Efficiency gains: a 5‑model ensemble calibrated with JUCAL can beat a 50‑model temperature‑scaled ensemble, reducing inference cost by up to an order of magnitude.
Methodology
-
Ensemble output decomposition – For each input x, an ensemble provides a predictive distribution p(y|x) (epistemic component) and an estimated label‑noise variance (aleatoric component).
-
Two‑parameter calibration – JUCAL introduces:
- a weight w that scales the epistemic term, and
- a temperature τ that rescales the aleatoric term.
The calibrated predictive distribution becomes
[ \tilde{p}(y|x) \propto \exp!\Big(\frac{w\cdot \text{logits}(x)}{\tau}\Big) ]
where the logits already embed the aleatoric variance.
-
Objective – The pair (w, τ) is learned by minimizing the negative log‑likelihood on a held‑out calibration set:
[ \min_{w, \tau} ; -\frac{1}{N}\sum_{i=1}^{N}\log \tilde{p}(y_i|x_i) ]
This is a convex 2‑D problem; standard gradient‑based solvers converge in a handful of iterations.
-
Post‑processing only – No retraining of the base models is required, making JUCAL a drop‑in replacement for existing pipelines.
Results & Findings
| Dataset (text) | Ensemble size | Baseline (Temp‑Scale) NLL | JUCAL NLL | Δ NLL | Predictive set size ↓ |
|---|---|---|---|---|---|
| AGNews | 5 | 0.842 | 0.720 | 14 % | 18 % |
| Yelp‑Polarity | 10 | 0.631 | 0.545 | 13 % | 20 % |
| DBpedia | 20 | 0.517 | 0.447 | 13 % | 15 % |
- Calibration quality: Reliability diagrams show JUCAL’s confidence curves hugging the diagonal far better than temperature scaling, especially in low‑confidence regions where epistemic uncertainty dominates.
- Cost‑effectiveness: A 5‑model JUCAL ensemble achieved the same NLL as a 30‑model temperature‑scaled ensemble, cutting GPU inference time by ~70 %.
- Robustness to ensembling strategy: Whether the ensemble was built via bagging, snapshot ensembling, or stochastic depth, JUCAL delivered consistent gains.
Practical Implications
- Production‑ready confidence scores – Services that expose probability estimates (e.g., content moderation, intent detection, medical triage) can replace temperature scaling with JUCAL to avoid over‑ or under‑confident predictions that could trigger false alarms or missed detections.
- Smaller ensembles, same performance – Teams can shrink ensemble sizes, saving memory and latency, while still meeting stringent calibration requirements (e.g., for regulatory compliance in finance or healthcare).
- Plug‑and‑play integration – Since JUCAL only needs the final logits and a validation split, it can be added to existing CI/CD model‑deployment pipelines with a few lines of code.
- Better downstream decision making – Calibrated uncertainties improve downstream tasks such as active learning, selective prediction, and risk‑aware reinforcement learning, where the balance between aleatoric and epistemic uncertainty is crucial.
Limitations & Future Work
- Assumes a single scalar weight and temperature – More complex, input‑dependent calibration functions could capture richer interactions between the two uncertainty sources.
- Validated only on text classification – While the authors argue the method is model‑agnostic, experiments on vision, speech, and tabular domains are still pending.
- Relies on a clean validation set – If the calibration data is itself noisy or distribution‑shifted, the learned parameters may be sub‑optimal. Future work could explore robust or online variants of JUCAL.
Bottom line: JUCAL offers a simple, computationally cheap way to bring aleatoric and epistemic uncertainties into harmony, delivering sharper, more trustworthy predictions for any ensemble‑based classifier. It’s a strong candidate to become the default calibration step in modern ML production stacks.
Authors
- Jakob Heiss
- Sören Lambrecht
- Jakob Weissteiner
- Hanna Wutte
- Žan Žurič
- Josef Teichmann
- Bin Yu
Paper Information
- arXiv ID: 2602.20153v1
- Categories: stat.ML, cs.LG, stat.ME
- Published: February 23, 2026
- PDF: Download PDF