[Paper] Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime
Source: arXiv - 2512.09850v1
Overview
The paper Conformal Bandits proposes a new way to marry two powerful ideas—conformal prediction (a technique that gives finite‑sample statistical guarantees) and multi‑armed bandits (the workhorse of online decision‑making). By doing so, the authors obtain algorithms that not only keep regret low in the notoriously hard small‑gap regime (where the best and sub‑optimal actions differ only by a tiny amount) but also provide explicit coverage guarantees on the predicted rewards. Their experiments, including a realistic portfolio‑allocation task, show that this hybrid approach can beat classic bandit methods such as UCB and Thompson Sampling when the reward gaps are tiny.
Key Contributions
- Conformal Bandit framework: Introduces a principled way to embed conformal prediction intervals into any bandit policy, turning regret‑focused algorithms into statistically valid predictors.
- Finite‑time coverage guarantees: Proves that the constructed prediction sets achieve the desired coverage probability without asymptotic assumptions.
- Small‑gap analysis: Demonstrates that the new method attains lower regret than traditional UCB/TS in settings where reward differences are minuscule, a regime where classic bounds become vacuous.
- Hidden Markov Model (HMM) integration: Shows how modeling regime switches in financial data with an HMM further improves exploration‑exploitation and yields higher risk‑adjusted returns.
- Empirical validation: Provides simulation studies and a real‑world portfolio‑allocation experiment that illustrate both regret reduction and reliable coverage.
Methodology
- Base Bandit Policy – Start with any standard bandit algorithm (e.g., UCB, Thompson Sampling).
- Conformal Score Construction – After each round, compute a non‑conformity score for the observed reward (e.g., absolute residual between the reward and the policy’s predicted mean).
- Prediction Set Update – Using the collection of past scores, form a quantile‑based prediction interval that, by the conformal prediction guarantee, contains the next reward with probability at least (1-\alpha).
- Decision Rule – The bandit selects the arm whose conformal interval suggests the highest upper bound (or another utility function), thereby balancing exploration (wide intervals) and exploitation (high predicted reward).
- HMM Extension – For non‑stationary environments (e.g., financial markets), an HMM is fitted online to capture latent regimes. The conformal scores are conditioned on the inferred hidden state, allowing the intervals to adapt to regime changes.
All steps are computationally lightweight: the conformal quantile can be updated in O(1) per round, and the HMM inference uses standard forward‑backward recursions.
Results & Findings
| Experiment | Baseline | Conformal Bandit (CB) | Regret Reduction | Coverage (Target 95 %) |
|---|---|---|---|---|
| Synthetic small‑gap bandit (Δ = 0.02) | UCB | CB‑UCB | ≈30 % lower cumulative regret | 94.8 % |
| Thompson Sampling (TS) vs. CB‑TS | TS | CB‑TS | ≈22 % lower regret | 95.3 % |
| Portfolio allocation (daily returns) | UCB‑Portfolio | CB‑UCB + HMM | ≈15 % higher Sharpe ratio (risk‑adjusted return) | 96.1 % |
Key takeaways
- In the small‑gap regime, classic UCB/TS suffer from high variance in arm selection, inflating regret. The conformal intervals act as a self‑calibrating confidence measure that curtails unnecessary exploration.
- Coverage stays close to the nominal level even with as few as 100 rounds, confirming the finite‑sample guarantee.
- Adding an HMM to model market regimes further sharpens performance, especially when the underlying reward distribution shifts abruptly.
Practical Implications
- A/B testing & online experimentation: When the lift between variants is tiny, Conformal Bandits can decide faster while still providing a statistical guarantee that the observed lift is not a fluke.
- Ad‑tech bidding: Real‑time bidding often operates under minuscule CPM differences; the method can reduce wasted spend by reliably identifying the marginally better ad placement.
- Financial algorithmic trading: Portfolio managers can use the HMM‑augmented version to adapt to market regime changes, achieving better risk‑adjusted returns without sacrificing statistical confidence in predicted returns.
- Robotics & control: In safety‑critical settings where actions have almost identical expected costs, conformal intervals give an extra layer of assurance before committing to a control input.
Implementation is straightforward: wrap any existing bandit library with a conformal‑prediction wrapper, and optionally plug in an HMM module for non‑stationary data streams.
Limitations & Future Work
- Scalability to many arms: The current analysis focuses on a modest number of arms (≤10). Extending the framework to high‑dimensional action spaces (e.g., contextual bandits with large feature sets) may require more efficient quantile‑estimation techniques.
- Choice of non‑conformity score: The paper uses simple absolute residuals; more sophisticated scores could improve interval tightness but need careful calibration.
- Theoretical regret bounds: While empirical regret improvements are shown, a formal regret bound that incorporates the conformal coverage term remains an open problem.
- Robustness to model misspecification: The HMM assumes a finite number of regimes; real markets may exhibit more complex dynamics. Future work could explore non‑parametric regime‑switching models or deep‑learning‑based latent state estimators.
Authors
- Simone Cuonzo
- Nina Deliu
Paper Information
- arXiv ID: 2512.09850v1
- Categories: cs.LG
- Published: December 10, 2025
- PDF: Download PDF