[Paper] The Sample Complexity of Multicalibration
Source: arXiv - 2604.21923v1
Overview
This paper investigates how many data points a learning algorithm needs to achieve multicalibration—a strong fairness/accuracy guarantee that a predictor’s confidence scores are well‑aligned not just overall, but across many sub‑populations (or “groups”). The authors pinpoint the exact sample complexity (up to polylog factors) for the batch setting, revealing a sharp transition: when the number of groups grows modestly with the target error ε, you need roughly ε⁻³ samples, whereas with a constant number of groups the requirement drops to ε⁻².
Key Contributions
- Tight sample‑complexity bounds for multicalibration in the batch (i.i.d.) setting:
- For up to |G| ≤ ε⁻ᵏ (with any fixed k > 0) groups, the optimal number of samples is ~Θ(ε⁻³).
- When the group family size is constant (k = 0), the bound improves to ~Θ(ε⁻²), matching marginal calibration.
- Separation from marginal calibration: shows multicalibration can be strictly harder (ε⁻³ vs. ε⁻²) even though both aim to align predicted probabilities with outcomes.
- Online‑to‑batch reduction: constructs a randomized predictor that attains the upper bound by converting an online multicalibration algorithm into a batch one.
- Generalized Lₚ multicalibration: extends the analysis to weighted Lₚ metrics (1 ≤ p ≤ 2) and proves optimal exponent 3/p for the sample complexity.
- Broader lower‑bound framework: adapts the technique to any elicitable property (e.g., expectiles, bounded‑density quantiles), yielding matching bounds when combined with recent online results.
Methodology
- Problem Formalization – The learner receives n i.i.d. examples ((X, Y)) from an unknown distribution and must output a predictor (\hat{p}) (possibly randomized). Multicalibration error is measured by the Expected Calibration Error (ECE) across a predefined collection of groups (G).
- Minimax Sample‑Complexity Analysis – The task is treated as a game between the learner and an adversarial distribution, deriving lower bounds via information‑theoretic arguments that hold even for randomized predictors.
- Upper Bound via Online‑to‑Batch – Starting from an existing online multicalibration algorithm (which guarantees low regret against any sequence of examples) and applying a standard online‑to‑batch conversion (e.g., averaging the iterates) yields a batch predictor whose sample complexity matches the lower bound up to logarithmic factors.
- Extension to Lₚ Metrics – By redefining the calibration error with an Lₚ norm and re‑working the concentration arguments, a family of bounds parameterized by (p) is obtained.
- Elicitable‑Property Generalization – Using the “regular class” definition of elicitable statistics, the lower‑bound construction is transplanted to other prediction tasks (expectiles, quantiles) and combined with recent online algorithms to achieve matching upper bounds.
Results & Findings
| Setting | Number of Groups | Sample Complexity (up to polylog) |
|---|---|---|
| Constant | ( | G |
| Growing | ( | G |
| General Lₚ | (1 \le p \le 2) | ~Θ(ε^{-3/p}) |
| Elicitable properties (e.g., expectiles) | – | Same ε‑dependent rates as multicalibration |
Interpretation:
- When calibration is required across many overlapping sub‑populations, the data requirement jumps from quadratic to cubic in (1/ε).
- The cubic dependence is tight: no algorithm can beat it, even with randomization.
- The threshold at (k = 0) is sharp—adding just a sub‑polynomial number of groups pushes you into the harder regime.
- The same hardness carries over to other statistics that can be “elicited” via proper scoring rules.
Practical Implications
- Fairness‑aware model deployment – Teams that enforce multicalibration (e.g., for credit scoring across demographic slices) should budget roughly three times more data than they would for ordinary calibration if the number of slices grows with the desired precision.
- Model‑selection trade‑offs – When data is scarce, it may be wiser to limit the group family (e.g., focus on a handful of high‑risk cohorts) to stay in the ε⁻² regime.
- Algorithm design – The online‑to‑batch reduction suggests that existing streaming multicalibration tools can be repurposed for offline training pipelines without redesigning the core algorithm.
- Beyond probabilities – The extension to expectiles and quantiles means that risk‑sensitive metrics (Value‑at‑Risk, conditional expectiles) can be calibrated with the same sample‑size guarantees, opening doors for regulated industries (finance, insurance).
- Tooling – Practitioners can now estimate the minimum dataset size needed for a target multicalibration error, aiding data‑collection planning and cost‑benefit analyses.
Limitations & Future Work
- Polylogarithmic gaps – The bounds hide logarithmic factors; tightening them could matter for very high‑precision regimes.
- Assumption on group structure – The analysis treats the group family as given and arbitrary; real‑world groups often have hierarchical or overlapping structures that might admit more efficient algorithms.
- Batch‑only focus – While the paper connects to online results, it does not explore hybrid settings (e.g., mini‑batch updates) that are common in modern training pipelines.
- Empirical validation – The work is theoretical; implementing the online‑to‑batch predictor and benchmarking against existing multicalibration libraries would solidify practical relevance.
- Extension to deep models – Understanding how these sample‑complexity limits interact with high‑capacity models (neural nets) and regularization techniques remains an open question.
Bottom line: If you’re building systems that must be calibrated across many sub‑populations or risk metrics, expect to need roughly ε⁻³ samples to hit a tight calibration error ε. This paper gives you the theoretical yardstick to plan data collection, choose group granularity, and select algorithms that are provably optimal in the worst case.
Authors
- Natalie Collina
- Jiuyao Lu
- Georgy Noarov
- Aaron Roth
Paper Information
- arXiv ID: 2604.21923v1
- Categories: cs.LG, math.ST, stat.ML
- Published: April 23, 2026
- PDF: Download PDF