[Paper] Conditional Coverage Diagnostics for Conformal Prediction

Published: 1 month ago (December 12, 2025 at 01:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11779v1

Overview

The paper introduces Excess Risk of the Target coverage (ERT), a new family of diagnostics that turn the problem of checking conditional coverage in conformal prediction into a standard classification task. By leveraging modern classifiers, ERT provides a statistically powerful, sample‑efficient way to spot where predictive sets systematically under‑ or over‑cover, something that existing tools struggle to do.

Key Contributions

Reformulation of conditional coverage testing as a binary classification problem, enabling the use of any off‑the‑shelf classifier.
Definition of the ERT metric, which quantifies the gap between the classifier’s risk and the nominal coverage target, yielding conservative estimates of common miscoverage measures (e.g., L₁/L₂ distance).
Separation of over‑coverage vs. under‑coverage and handling of non‑constant (heterogeneous) target coverages within a single framework.
Empirical demonstration that modern, high‑capacity classifiers (e.g., gradient‑boosted trees, deep nets) achieve far higher statistical power than the classic CovGap metric based on simple linear classifiers.
Comprehensive benchmark of several conformal prediction methods (split‑conformal, cross‑conformal, jackknife+, etc.) using the new diagnostics.
Open‑source release of a Python package that implements ERT alongside legacy conditional coverage metrics, facilitating immediate adoption.

Methodology

Problem framing – For a given test point (x) and its conformal prediction set (\mathcal{C}(x)), conditional coverage holds if
[ \Pr\bigl(Y \in \mathcal{C}(x) \mid X = x\bigr) \geq 1-\alpha . ]
The authors observe that a violation exists exactly when there exists a classifier that can predict “miscovered” vs. “covered” with error lower than the target coverage (1-\alpha).
Classification reduction – Construct a binary label (Z = \mathbf{1}{Y \notin \mathcal{C}(X)}). Train any probabilistic classifier (g_\theta) to predict (Z) from (X).
Proper loss & excess risk – Using a proper loss (e.g., log‑loss or squared loss), compute the empirical risk (R(g_\theta)). The ERT is defined as
[ \text{ERT} = R(g_\theta) - (1-\alpha) . ]
Positive ERT indicates systematic under‑coverage; negative values signal over‑coverage. By selecting different losses, the metric can approximate L₁/L₂ miscoverage distances.
Statistical testing – A permutation‑based or asymptotic test evaluates whether the observed ERT is significantly greater than zero, providing a diagnostic rather than a mere point estimate.
Implementation – The authors plug in a suite of classifiers (logistic regression, random forests, XGBoost, neural nets) and compare their power to detect conditional coverage violations.

Results & Findings

Experiment	Metric	Classifier	Power to Detect Violation (α=0.1)
Synthetic heteroskedastic regression	ERT (log‑loss)	XGBoost	0.92
Same setup	CovGap (linear)	–	0.48
Real‑world image classification (CIFAR‑10)	ERT (cross‑entropy)	ResNet‑18	0.81
Same setup	CovGap	–	0.33

Higher power: Modern classifiers consistently doubled the detection power compared with CovGap’s linear baseline.
Granular diagnostics: By inspecting the classifier’s calibrated probabilities, the authors could pinpoint regions of feature space where under‑coverage was severe (e.g., rare classes, high‑variance inputs).
Benchmark insights: Among the conformal methods tested, cross‑conformal and jackknife+ showed the smallest ERT values across most datasets, confirming their superior conditional reliability.

Practical Implications

Debugging predictive pipelines: Developers can now attach an ERT check to any conformal predictor to automatically flag subpopulations where the coverage guarantee fails.
Model selection & hyper‑parameter tuning: Because ERT is differentiable with respect to the classifier’s parameters, it can be used as a validation metric when choosing the underlying regression/classification model for conformal inference.
Regulatory compliance: In high‑stakes domains (healthcare, finance), regulators often demand evidence of local reliability. ERT provides a statistically sound, easy‑to‑explain certificate that can be included in model cards or model‑risk assessments.
Adaptive conformal methods: The diagnostic can drive conditional recalibration—e.g., adjusting the non‑conformity score threshold in regions where ERT indicates under‑coverage, leading to tighter yet reliable prediction sets.
Tooling: The released Python package (ert-metrics) integrates with scikit‑learn, PyTorch, and TensorFlow, making it straightforward to plug into existing CI pipelines.

Limitations & Future Work

Sample efficiency still depends on classifier quality: In extremely low‑sample regimes, even powerful classifiers may overfit, leading to overly optimistic (i.e., low) ERT values.
Choice of loss influences interpretation: While the authors provide guidance, selecting the “right” proper loss for a given application may require domain expertise.
Computational overhead: Training a high‑capacity classifier for each conformal method under evaluation adds runtime, which could be prohibitive for very large datasets.
Theoretical guarantees: The paper offers conservative bounds but does not yet prove tightness of the ERT estimate under arbitrary data distributions.

Future directions include: (1) developing sample‑adaptive classifiers that automatically regularize based on the size of the calibration set, (2) extending ERT to multi‑label or structured output spaces, and (3) integrating the metric into end‑to‑end differentiable conformal pipelines for joint optimization of predictive performance and conditional coverage.

Authors

Sacha Braun
David Holzmüller
Michael I. Jordan
Francis Bach

Paper Information

arXiv ID: 2512.11779v1
Categories: stat.ML, cs.AI, cs.LG
Published: December 12, 2025
PDF: Download PDF

[Paper] Conditional Coverage Diagnostics for Conformal Prediction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously