[Paper] The Sample Complexity of Multicalibration

Published: 1 day ago (April 23, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21923v1

Overview

This paper investigates how many data points a learning algorithm needs to achieve multicalibration—a strong fairness/accuracy guarantee that a predictor’s confidence scores are well‑aligned not just overall, but across many sub‑populations (or “groups”). The authors pinpoint the exact sample complexity (up to polylog factors) for the batch setting, revealing a sharp transition: when the number of groups grows modestly with the target error ε, you need roughly ε⁻³ samples, whereas with a constant number of groups the requirement drops to ε⁻².

Key Contributions

Tight sample‑complexity bounds for multicalibration in the batch (i.i.d.) setting:
- For up to |G| ≤ ε⁻ᵏ (with any fixed k > 0) groups, the optimal number of samples is ~Θ(ε⁻³).
- When the group family size is constant (k = 0), the bound improves to ~Θ(ε⁻²), matching marginal calibration.
Separation from marginal calibration: shows multicalibration can be strictly harder (ε⁻³ vs. ε⁻²) even though both aim to align predicted probabilities with outcomes.
Online‑to‑batch reduction: constructs a randomized predictor that attains the upper bound by converting an online multicalibration algorithm into a batch one.
Generalized Lₚ multicalibration: extends the analysis to weighted Lₚ metrics (1 ≤ p ≤ 2) and proves optimal exponent 3/p for the sample complexity.
Broader lower‑bound framework: adapts the technique to any elicitable property (e.g., expectiles, bounded‑density quantiles), yielding matching bounds when combined with recent online results.

Methodology

Problem Formalization – The learner receives n i.i.d. examples ((X, Y)) from an unknown distribution and must output a predictor (\hat{p}) (possibly randomized). Multicalibration error is measured by the Expected Calibration Error (ECE) across a predefined collection of groups (G).
Minimax Sample‑Complexity Analysis – The task is treated as a game between the learner and an adversarial distribution, deriving lower bounds via information‑theoretic arguments that hold even for randomized predictors.
Upper Bound via Online‑to‑Batch – Starting from an existing online multicalibration algorithm (which guarantees low regret against any sequence of examples) and applying a standard online‑to‑batch conversion (e.g., averaging the iterates) yields a batch predictor whose sample complexity matches the lower bound up to logarithmic factors.
Extension to Lₚ Metrics – By redefining the calibration error with an Lₚ norm and re‑working the concentration arguments, a family of bounds parameterized by (p) is obtained.
Elicitable‑Property Generalization – Using the “regular class” definition of elicitable statistics, the lower‑bound construction is transplanted to other prediction tasks (expectiles, quantiles) and combined with recent online algorithms to achieve matching upper bounds.

Results & Findings

Setting	Number of Groups	Sample Complexity (up to polylog)
Constant	(	G
Growing	(	G
General Lₚ	(1 \le p \le 2)	~Θ(ε^{-3/p})
Elicitable properties (e.g., expectiles)	–	Same ε‑dependent rates as multicalibration

Interpretation:

When calibration is required across many overlapping sub‑populations, the data requirement jumps from quadratic to cubic in (1/ε).
The cubic dependence is tight: no algorithm can beat it, even with randomization.
The threshold at (k = 0) is sharp—adding just a sub‑polynomial number of groups pushes you into the harder regime.
The same hardness carries over to other statistics that can be “elicited” via proper scoring rules.

Practical Implications

Fairness‑aware model deployment – Teams that enforce multicalibration (e.g., for credit scoring across demographic slices) should budget roughly three times more data than they would for ordinary calibration if the number of slices grows with the desired precision.
Model‑selection trade‑offs – When data is scarce, it may be wiser to limit the group family (e.g., focus on a handful of high‑risk cohorts) to stay in the ε⁻² regime.
Algorithm design – The online‑to‑batch reduction suggests that existing streaming multicalibration tools can be repurposed for offline training pipelines without redesigning the core algorithm.
Beyond probabilities – The extension to expectiles and quantiles means that risk‑sensitive metrics (Value‑at‑Risk, conditional expectiles) can be calibrated with the same sample‑size guarantees, opening doors for regulated industries (finance, insurance).
Tooling – Practitioners can now estimate the minimum dataset size needed for a target multicalibration error, aiding data‑collection planning and cost‑benefit analyses.

Limitations & Future Work

Polylogarithmic gaps – The bounds hide logarithmic factors; tightening them could matter for very high‑precision regimes.
Assumption on group structure – The analysis treats the group family as given and arbitrary; real‑world groups often have hierarchical or overlapping structures that might admit more efficient algorithms.
Batch‑only focus – While the paper connects to online results, it does not explore hybrid settings (e.g., mini‑batch updates) that are common in modern training pipelines.
Empirical validation – The work is theoretical; implementing the online‑to‑batch predictor and benchmarking against existing multicalibration libraries would solidify practical relevance.
Extension to deep models – Understanding how these sample‑complexity limits interact with high‑capacity models (neural nets) and regularization techniques remains an open question.

Bottom line: If you’re building systems that must be calibrated across many sub‑populations or risk metrics, expect to need roughly ε⁻³ samples to hit a tight calibration error ε. This paper gives you the theoretical yardstick to plan data collection, choose group granularity, and select algorithms that are provably optimal in the worst case.

Authors

Natalie Collina
Jiuyao Lu
Georgy Noarov
Aaron Roth

Paper Information

arXiv ID: 2604.21923v1
Categories: cs.LG, math.ST, stat.ML
Published: April 23, 2026
PDF: Download PDF

[Paper] The Sample Complexity of Multicalibration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

[Paper] Fine-Tuning Regimes Define Distinct Continual Learning Problems

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs