[Paper] Conformal Risk Control for Non-Monotonic Losses
Source: arXiv - 2602.20151v1
Overview
The paper expands conformal risk control—a framework that lets you guarantee that a model’s loss stays within a user‑specified budget—beyond the classic setting where the loss is monotonic in a single threshold. By handling non‑monotonic losses and multidimensional control parameters, the authors make risk‑controlled predictions usable for a far broader set of real‑world tasks such as selective classification, medical image segmentation, and fairness‑aware recidivism scoring.
Key Contributions
- Generalized risk‑control guarantees for arbitrary loss functions, even when the loss does not increase monotonically with the control parameter.
- Multidimensional control: the theory supports several knobs (e.g., coverage, false‑discovery rate, intersection‑over‑union) that can be tuned simultaneously.
- Stability‑aware bounds: the tightness of the guarantee scales with a measurable notion of algorithmic stability, highlighting why well‑behaved (stable) learners are preferable.
- Concrete applications demonstrating the method on:
- Selective image classification – rejecting uncertain predictions to keep error below a target.
- FDR and IoU control for tumor segmentation – ensuring a bounded false‑discovery rate while maintaining segmentation quality.
- Multigroup debiasing for recidivism risk scores – providing per‑intersection fairness guarantees across overlapping race‑sex groups.
- Open‑source implementation (released with the paper) that plugs into standard ML pipelines (PyTorch, scikit‑learn).
Methodology
- Base algorithm – Any black‑box predictor (e.g., a neural net) that outputs a score or a set of candidate outputs.
- Loss definition – The user supplies a loss function (L(z, \theta)) where (z) is a data point and (\theta \in \mathbb{R}^d) are control parameters (e.g., a confidence threshold, a group‑specific multiplier). The loss can be non‑monotonic in each component of (\theta).
- Stability metric – For a training set (S) and a perturbed set (S’) (differing by one example), the algorithm’s output changes by at most (\Delta). This (\Delta) is estimated empirically or bounded analytically.
- Conformal calibration set – A held‑out calibration split is used to compute empirical losses for many candidate (\theta) values.
- Quantile selection – Using a distribution‑free bound (based on the Dvoretzky‑Kiefer‑Wolfowitz inequality) and the stability term (\Delta), the method selects the smallest (\theta) that satisfies the user‑specified risk level (\alpha) with high probability (e.g., 95 %).
- Prediction time – The chosen (\theta^*) is fixed for all future test points; the underlying predictor runs unchanged, but its output is post‑processed (e.g., reject if confidence < (\theta^*_1), adjust segmentation mask if IoU < (\theta^*_2), etc.).
The key insight is that the risk guarantee holds without any distributional assumptions and works even when the loss surface is wavy, as long as the algorithm is not too unstable.
Results & Findings
| Application | Target risk | Achieved risk (95% CI) | Remarks |
|---|---|---|---|
| Selective classification (CIFAR‑10) | ≤ 5 % error on accepted examples | 4.8 % (±0.2 %) | Rejection rate ~ 22 % |
| Tumor segmentation (BraTS) – FDR | ≤ 10 % false discoveries | 9.6 % (±0.4 %) | IoU ≥ 0.75 for 87 % of accepted masks |
| Recidivism risk scores – multigroup DP | ≤ 2 % disparity across 6 race‑sex intersections | 1.7 % (±0.3 %) | No loss in overall AUC (0.78) |
Across all experiments, the stability‑aware bounds were noticeably tighter than naïve worst‑case bounds, confirming the theoretical claim that more stable learners (e.g., regularized logistic regression) enjoy sharper risk control.
Practical Implications
- Deployable safety nets: Teams can wrap any existing model with a conformal risk controller to guarantee that, say, the false‑positive rate for a medical diagnosis never exceeds a regulatory threshold, while still delivering predictions for the majority of cases.
- Selective inference pipelines: In production systems where latency matters, the controller can decide on‑the‑fly whether to serve a prediction or fall back to a human‑in‑the‑loop, based on a calibrated risk budget.
- Fairness‑by‑design: The multigroup extension lets product owners enforce intersectional fairness constraints (race × gender, age × location, etc.) without retraining the model for each subgroup.
- Model‑agnostic integration: Because the method only needs a calibration set and a stability estimate, it can be added to legacy models (e.g., gradient‑boosted trees) without touching the training code.
- Regulatory compliance: The distribution‑free guarantees align well with emerging AI governance frameworks that require quantifiable risk bounds rather than opaque statistical assumptions.
Limitations & Future Work
- Stability estimation can be costly for large deep nets; the paper uses a simple empirical leave‑one‑out approach, which may underestimate true instability in highly non‑convex models.
- The guarantees are conservative when the calibration set is small; scaling to massive streaming data streams will require online conformal updates.
- The current theory assumes a fixed calibration split; extending to cross‑validated or bootstrap‑based calibration could improve data efficiency.
- Future research directions mentioned by the authors include: (1) automatic tuning of the stability‑regularization trade‑off, (2) handling structured outputs beyond sets (e.g., graphs), and (3) tighter finite‑sample bounds using recent advances in martingale concentration.
Authors
- Anastasios N. Angelopoulos
Paper Information
- arXiv ID: 2602.20151v1
- Categories: stat.ME, cs.LG, math.ST, stat.ML
- Published: February 23, 2026
- PDF: Download PDF