[Paper] JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

Published: 3 days ago (February 23, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.20153v1

Overview

The paper introduces JUCAL, a lightweight post‑processing technique that simultaneously calibrates aleatoric (inherent data noise) and epistemic (model‑related) uncertainty for any pre‑trained ensemble of classifiers. By balancing these two sources of uncertainty, JUCAL delivers far more reliable confidence estimates than traditional temperature scaling or conformal methods, cutting negative log‑likelihood (NLL) and predictive set sizes by up to 15 % and 20 % respectively.

Key Contributions

Joint calibration framework that learns two scalar parameters (a weight and a scale) to balance aleatoric and epistemic uncertainty.
Model‑agnostic: works with any ensemble architecture (transformers, CNNs, gradient‑boosted trees, etc.) without needing internal weights or gradients.
Simple optimization: parameters are fitted by minimizing NLL on a held‑out validation set, adding negligible computational overhead.
Empirical superiority: across multiple text classification benchmarks, JUCAL consistently outperforms temperature scaling, isotonic regression, and conformal calibration in both NLL and predictive set size.
Efficiency gains: a 5‑model ensemble calibrated with JUCAL can beat a 50‑model temperature‑scaled ensemble, reducing inference cost by up to an order of magnitude.

Methodology

Ensemble output decomposition – For each input x, an ensemble provides a predictive distribution p(y|x) (epistemic component) and an estimated label‑noise variance (aleatoric component).
Two‑parameter calibration – JUCAL introduces:
- a weight w that scales the epistemic term, and
- a temperature τ that rescales the aleatoric term.
  The calibrated predictive distribution becomes
[ \tilde{p}(y|x) \propto \exp!\Big(\frac{w\cdot \text{logits}(x)}{\tau}\Big) ]

where the logits already embed the aleatoric variance.
Objective – The pair (w, τ) is learned by minimizing the negative log‑likelihood on a held‑out calibration set:

[ \min_{w, \tau} ; -\frac{1}{N}\sum_{i=1}^{N}\log \tilde{p}(y_i|x_i) ]

This is a convex 2‑D problem; standard gradient‑based solvers converge in a handful of iterations.
Post‑processing only – No retraining of the base models is required, making JUCAL a drop‑in replacement for existing pipelines.

Results & Findings

Dataset (text)	Ensemble size	Baseline (Temp‑Scale) NLL	JUCAL NLL	Δ NLL	Predictive set size ↓
AGNews	5	0.842	0.720	14 %	18 %
Yelp‑Polarity	10	0.631	0.545	13 %	20 %
DBpedia	20	0.517	0.447	13 %	15 %

Calibration quality: Reliability diagrams show JUCAL’s confidence curves hugging the diagonal far better than temperature scaling, especially in low‑confidence regions where epistemic uncertainty dominates.
Cost‑effectiveness: A 5‑model JUCAL ensemble achieved the same NLL as a 30‑model temperature‑scaled ensemble, cutting GPU inference time by ~70 %.
Robustness to ensembling strategy: Whether the ensemble was built via bagging, snapshot ensembling, or stochastic depth, JUCAL delivered consistent gains.

Practical Implications

Production‑ready confidence scores – Services that expose probability estimates (e.g., content moderation, intent detection, medical triage) can replace temperature scaling with JUCAL to avoid over‑ or under‑confident predictions that could trigger false alarms or missed detections.
Smaller ensembles, same performance – Teams can shrink ensemble sizes, saving memory and latency, while still meeting stringent calibration requirements (e.g., for regulatory compliance in finance or healthcare).
Plug‑and‑play integration – Since JUCAL only needs the final logits and a validation split, it can be added to existing CI/CD model‑deployment pipelines with a few lines of code.
Better downstream decision making – Calibrated uncertainties improve downstream tasks such as active learning, selective prediction, and risk‑aware reinforcement learning, where the balance between aleatoric and epistemic uncertainty is crucial.

Limitations & Future Work

Assumes a single scalar weight and temperature – More complex, input‑dependent calibration functions could capture richer interactions between the two uncertainty sources.
Validated only on text classification – While the authors argue the method is model‑agnostic, experiments on vision, speech, and tabular domains are still pending.
Relies on a clean validation set – If the calibration data is itself noisy or distribution‑shifted, the learned parameters may be sub‑optimal. Future work could explore robust or online variants of JUCAL.

Bottom line: JUCAL offers a simple, computationally cheap way to bring aleatoric and epistemic uncertainties into harmony, delivering sharper, more trustworthy predictions for any ensemble‑based classifier. It’s a strong candidate to become the default calibration step in modern ML production stacks.

Authors

Jakob Heiss
Sören Lambrecht
Jakob Weissteiner
Hanna Wutte
Žan Žurič
Josef Teichmann
Bin Yu

Paper Information

arXiv ID: 2602.20153v1
Categories: stat.ML, cs.LG, stat.ME
Published: February 23, 2026
PDF: Download PDF

[Paper] JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Surrogate models for Rock-Fluid Interaction: A Grid-Size-Invariant Approach