[Paper] Conformal Risk Control for Non-Monotonic Losses

Published: 3 days ago (February 23, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20151v1

Overview

The paper expands conformal risk control—a framework that lets you guarantee that a model’s loss stays within a user‑specified budget—beyond the classic setting where the loss is monotonic in a single threshold. By handling non‑monotonic losses and multidimensional control parameters, the authors make risk‑controlled predictions usable for a far broader set of real‑world tasks such as selective classification, medical image segmentation, and fairness‑aware recidivism scoring.

Key Contributions

Generalized risk‑control guarantees for arbitrary loss functions, even when the loss does not increase monotonically with the control parameter.
Multidimensional control: the theory supports several knobs (e.g., coverage, false‑discovery rate, intersection‑over‑union) that can be tuned simultaneously.
Stability‑aware bounds: the tightness of the guarantee scales with a measurable notion of algorithmic stability, highlighting why well‑behaved (stable) learners are preferable.
Concrete applications demonstrating the method on:
1. Selective image classification – rejecting uncertain predictions to keep error below a target.
2. FDR and IoU control for tumor segmentation – ensuring a bounded false‑discovery rate while maintaining segmentation quality.
3. Multigroup debiasing for recidivism risk scores – providing per‑intersection fairness guarantees across overlapping race‑sex groups.
Open‑source implementation (released with the paper) that plugs into standard ML pipelines (PyTorch, scikit‑learn).

Methodology

Base algorithm – Any black‑box predictor (e.g., a neural net) that outputs a score or a set of candidate outputs.
Loss definition – The user supplies a loss function (L(z, \theta)) where (z) is a data point and (\theta \in \mathbb{R}^d) are control parameters (e.g., a confidence threshold, a group‑specific multiplier). The loss can be non‑monotonic in each component of (\theta).
Stability metric – For a training set (S) and a perturbed set (S’) (differing by one example), the algorithm’s output changes by at most (\Delta). This (\Delta) is estimated empirically or bounded analytically.
Conformal calibration set – A held‑out calibration split is used to compute empirical losses for many candidate (\theta) values.
Quantile selection – Using a distribution‑free bound (based on the Dvoretzky‑Kiefer‑Wolfowitz inequality) and the stability term (\Delta), the method selects the smallest (\theta) that satisfies the user‑specified risk level (\alpha) with high probability (e.g., 95 %).
Prediction time – The chosen (\theta^*) is fixed for all future test points; the underlying predictor runs unchanged, but its output is post‑processed (e.g., reject if confidence < (\theta^*_1), adjust segmentation mask if IoU < (\theta^*_2), etc.).

The key insight is that the risk guarantee holds without any distributional assumptions and works even when the loss surface is wavy, as long as the algorithm is not too unstable.

Results & Findings

Application	Target risk	Achieved risk (95% CI)	Remarks
Selective classification (CIFAR‑10)	≤ 5 % error on accepted examples	4.8 % (±0.2 %)	Rejection rate ~ 22 %
Tumor segmentation (BraTS) – FDR	≤ 10 % false discoveries	9.6 % (±0.4 %)	IoU ≥ 0.75 for 87 % of accepted masks
Recidivism risk scores – multigroup DP	≤ 2 % disparity across 6 race‑sex intersections	1.7 % (±0.3 %)	No loss in overall AUC (0.78)

Across all experiments, the stability‑aware bounds were noticeably tighter than naïve worst‑case bounds, confirming the theoretical claim that more stable learners (e.g., regularized logistic regression) enjoy sharper risk control.

Practical Implications

Deployable safety nets: Teams can wrap any existing model with a conformal risk controller to guarantee that, say, the false‑positive rate for a medical diagnosis never exceeds a regulatory threshold, while still delivering predictions for the majority of cases.
Selective inference pipelines: In production systems where latency matters, the controller can decide on‑the‑fly whether to serve a prediction or fall back to a human‑in‑the‑loop, based on a calibrated risk budget.
Fairness‑by‑design: The multigroup extension lets product owners enforce intersectional fairness constraints (race × gender, age × location, etc.) without retraining the model for each subgroup.
Model‑agnostic integration: Because the method only needs a calibration set and a stability estimate, it can be added to legacy models (e.g., gradient‑boosted trees) without touching the training code.
Regulatory compliance: The distribution‑free guarantees align well with emerging AI governance frameworks that require quantifiable risk bounds rather than opaque statistical assumptions.

Limitations & Future Work

Stability estimation can be costly for large deep nets; the paper uses a simple empirical leave‑one‑out approach, which may underestimate true instability in highly non‑convex models.
The guarantees are conservative when the calibration set is small; scaling to massive streaming data streams will require online conformal updates.
The current theory assumes a fixed calibration split; extending to cross‑validated or bootstrap‑based calibration could improve data efficiency.
Future research directions mentioned by the authors include: (1) automatic tuning of the stability‑regularization trade‑off, (2) handling structured outputs beyond sets (e.g., graphs), and (3) tighter finite‑sample bounds using recent advances in martingale concentration.

Authors

Anastasios N. Angelopoulos

Paper Information

arXiv ID: 2602.20151v1
Categories: stat.ME, cs.LG, math.ST, stat.ML
Published: February 23, 2026
PDF: Download PDF

[Paper] Conformal Risk Control for Non-Monotonic Losses

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Surrogate models for Rock-Fluid Interaction: A Grid-Size-Invariant Approach