[Paper] Optimism Stabilizes Thompson Sampling for Adaptive Inference

Published: (February 5, 2026 at 01:52 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06014v1

Overview

This paper tackles a subtle but important problem: when you use Thompson Sampling (TS) to explore and exploit in a multi‑armed bandit, the data you collect is adaptive—the number of pulls each arm receives depends on the rewards you’ve seen so far. That adaptivity can break the usual statistical guarantees (e.g., confidence intervals) that rely on fixed‑sample theory. The authors show that injecting a modest amount of optimism into TS restores the “stability” needed for reliable asymptotic inference, even when many arms are equally good.

Key Contributions

  • Stability proof for variance‑inflated TS in any (K)-armed Gaussian bandit, extending prior results that only covered two arms.
  • Alternative optimistic TS that keeps posterior variance unchanged but adds a mean bonus, also proven to be stable.
  • Demonstration that both optimistic variants achieve asymptotically valid inference (e.g., confidence intervals) while incurring only a small regret penalty.
  • Formal connection between optimism (a classic exploration principle) and statistical stability under adaptive data collection.

Methodology

  1. Problem setting – The authors consider the standard stochastic (K)-armed Gaussian bandit: each arm (i) yields i.i.d. rewards (r_{i,t}\sim\mathcal N(\mu_i, \sigma^2)).
  2. Thompson Sampling baseline – At each round, TS draws a sample from the posterior of each arm’s mean and selects the arm with the highest sampled value.
  3. Instability issue – Because the number of pulls (N_i(t)) for each arm is random and coupled with the rewards, the classic Central Limit Theorem (CLT) for the sample mean may fail; the pull counts need to “concentrate” around deterministic rates.
  4. Optimistic modifications
    • Variance‑inflated TS (from Halder et al. 2025): artificially enlarge the posterior variance by a factor >1 before sampling.
    • Mean‑bonus TS (new): add a deterministic optimism bonus (\beta_t) to the posterior mean while leaving variance untouched.
  5. Stability analysis – Using martingale concentration, coupling arguments, and asymptotic normality tools, the authors prove that under either modification the pull counts (N_i(t)) satisfy
    [ \frac{N_i(t)}{t} \xrightarrow{p} \lambda_i \quad\text{for some deterministic }\lambda_i>0, ]
    which is the stability condition required for valid inference.
  6. Regret evaluation – They bound the extra regret introduced by optimism, showing it grows only logarithmically with time, i.e., the cost is mild compared to the benefit of stable inference.

Results & Findings

VariantStability (proved)Regret overheadPractical inference
Standard TSNo (fails when multiple optimal arms)Confidence intervals can be misleading
Variance‑inflated TS✅ for any (K)(O(\log T)) extra regretAsymptotically correct confidence intervals
Mean‑bonus TS✅ for any (K)(O(\log T)) extra regretSame inference guarantees, simpler implementation

The key takeaway is that adding optimism—either by inflating variance or by a mean bonus—forces each arm to be pulled often enough for the CLT to kick in, even when the algorithm is aggressively exploiting the best arms.

Practical Implications

  • A/B testing & online experimentation – When you run multi‑variant tests that adaptively allocate traffic (e.g., bandit‑driven feature rollouts), using an optimistic TS variant lets you compute valid confidence intervals on conversion rates without resorting to costly fixed‑sample designs.
  • Reinforcement learning pipelines – Many RL systems use bandit‑style exploration for hyperparameter tuning or policy selection. Plug‑in an optimism bonus to the posterior mean and you retain statistical guarantees for downstream performance estimates.
  • Production services – Implementing the mean‑bonus version is straightforward (just add a decaying bonus term to the sampled mean). The extra regret is negligible for typical traffic volumes, making it a low‑risk upgrade over vanilla TS.
  • Tooling – Libraries such as bandit, MABWiser, or custom Python/Go services can expose an “optimistic” flag that internally applies the variance inflation or mean bonus, giving developers a ready‑made statistically‑sound exploration strategy.

Limitations & Future Work

  • The analysis assumes Gaussian rewards with known variance; extending the stability proofs to bounded or heavy‑tailed reward distributions remains open.
  • The optimism parameters (inflation factor or bonus schedule) are theoretically motivated but may need empirical tuning for specific domains.
  • The work focuses on asymptotic inference; finite‑sample confidence interval calibration (e.g., via bootstrap) is not addressed.
  • Future research could explore contextual bandits, where the optimism mechanism must interact with high‑dimensional feature representations, and investigate whether similar stability guarantees hold.

Authors

  • Shunxing Yan
  • Han Zhong

Paper Information

  • arXiv ID: 2602.06014v1
  • Categories: cs.LG, cs.AI, math.OC, math.ST, stat.ML
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »