[Paper] Calibrated Multi-Level Quantile Forecasting

Published: 1 week ago (December 29, 2025 at 01:25 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23671v1

Overview

The paper introduces Multi-Level Quantile Tracker (MultiQT), an online wrapper that can be added to any existing point‑or‑quantile predictor to guarantee calibration across several quantile levels at once. In simple terms, MultiQT ensures that a 60 % quantile forecast is higher than the true outcome about 60 % of the time, even when the data distribution changes abruptly. This is achieved without sacrificing predictive accuracy and with provable no‑regret guarantees.

Key Contributions

Unified multi‑quantile calibration: A single algorithm that simultaneously enforces calibration for an arbitrary set of quantile levels (e.g., 0.1, 0.5, 0.9).
Model‑agnostic wrapper: MultiQT can be placed around any off‑the‑shelf forecaster (ARIMA, LSTM, Prophet, etc.) and automatically corrects its outputs.
Adversarial robustness: Guarantees hold even under worst‑case, non‑stationary distribution shifts—useful for real‑time systems where data drift is common.
Monotonicity preservation: The corrected forecasts remain ordered (lower quantiles never exceed higher ones), a property often broken by naive post‑processing.
No‑regret quantile‑loss bound: As the horizon grows, MultiQT’s quantile loss converges to that of the underlying forecaster, meaning it won’t degrade performance asymptotically.
Empirical validation: Demonstrated substantial calibration gains on epidemic (COVID‑19 case counts) and energy demand forecasting tasks, with only marginal impact on raw predictive error.

Methodology

Online calibration game

The authors frame quantile forecasting as a repeated game where, at each time step t, the forecaster outputs a set of quantile predictions (\hat{q}_t^{(\alpha)}) for chosen levels (\alpha). After observing the true outcome (y_t), the algorithm checks whether each prediction satisfies the calibration condition (i.e., (\hat{q}_t^{(\alpha)} \le y_t) for roughly an (\alpha) fraction of steps).

MultiQT wrapper

Error counters: For each quantile level, MultiQT maintains a running count of calibration “mistakes” (how often the prediction was too low or too high).
Adjustment rule: When a level drifts away from its target fraction, MultiQT nudges the forecast up or down by a small amount proportional to the cumulative error.
Isotonic projection: After adjusting all levels, the algorithm applies a lightweight isotonic regression to enforce monotonicity (ensuring (\hat{q}^{(\alpha)} \le \hat{q}^{(\beta)}) whenever (\alpha < \beta)).

Theoretical guarantees

Using tools from online learning (e.g., regret analysis) and martingale concentration, the authors prove that the calibration error converges to zero and that the additional quantile loss incurred by the adjustments vanishes in the long run (no‑regret).

Implementation details

MultiQT runs in (O(K)) time per step for (K) quantile levels and requires only constant memory per level, making it suitable for high‑frequency streaming applications.

Results & Findings

Dataset	Baseline forecaster	Calibration error (pre‑MultiQT)	Calibration error (post‑MultiQT)	Quantile loss change
COVID‑19 weekly cases (US)	Prophet + quantile regression	0.18 (10 % level) – 0.32 (90 % level)	0.04 – 0.07	+0.3 % MAE
Hourly electricity demand (CAISO)	Gradient‑boosted trees	0.12 – 0.27	0.02 – 0.05	+0.1 % RMSE

Calibration improvement: Across all quantile levels, the deviation from the target coverage dropped by a factor of 4–6.
Negligible accuracy loss: The increase in standard point‑forecast error metrics (MAE, RMSE) was under 0.5 %, confirming the no‑regret claim.
Robustness to drift: In simulated regime‑change experiments (e.g., sudden spikes in demand), MultiQT re‑calibrated within a handful of steps, whereas the raw forecaster remained mis‑calibrated for the entire horizon.

Practical Implications

Risk‑aware decision making: Many production systems (inventory planning, load balancing, financial risk) rely on quantile forecasts to set safety buffers. MultiQT guarantees those buffers are statistically sound, reducing over‑ or under‑provisioning.
Plug‑and‑play for existing pipelines: Because MultiQT is a thin wrapper, teams can retrofit it onto legacy models without retraining, saving engineering effort.
Streaming & edge deployments: The algorithm’s constant‑time updates and tiny memory footprint make it viable for real‑time inference on IoT devices or low‑latency services.
Regulatory compliance: In domains like healthcare or energy, calibrated predictive intervals are often required for auditability; MultiQT provides a mathematically backed way to meet such standards.

Limitations & Future Work

Dependence on initial forecaster quality: MultiQT can only correct calibration; if the underlying model’s point predictions are severely biased, the resulting quantiles may still be inaccurate.
Fixed quantile set: The current formulation assumes a pre‑specified list of quantile levels. Dynamically adding or removing levels would require re‑initializing the counters.
Theoretical focus on adversarial settings: While robustness is a strength, the worst‑case analysis may be overly conservative for many practical, mildly non‑stationary streams.
Future directions: Extending MultiQT to handle multivariate quantiles (e.g., joint demand‑price forecasts), integrating adaptive learning rates for faster drift recovery, and exploring hybrid approaches that jointly train the base forecaster and the calibration wrapper.

Authors

Tiffany Ding
Isaac Gibbs
Ryan J. Tibshirani

Paper Information

arXiv ID: 2512.23671v1
Categories: stat.ML, cs.LG, math.OC, stat.ME
Published: December 29, 2025
PDF: Download PDF