[Paper] Optimism Stabilizes Thompson Sampling for Adaptive Inference

Published: 2 months ago (February 5, 2026 at 01:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06014v1

Overview

This paper tackles a subtle but important problem: when you use Thompson Sampling (TS) to explore and exploit in a multi‑armed bandit, the data you collect is adaptive—the number of pulls each arm receives depends on the rewards you’ve seen so far. That adaptivity can break the usual statistical guarantees (e.g., confidence intervals) that rely on fixed‑sample theory. The authors show that injecting a modest amount of optimism into TS restores the “stability” needed for reliable asymptotic inference, even when many arms are equally good.

Key Contributions

Stability proof for variance‑inflated TS in any (K)-armed Gaussian bandit, extending prior results that only covered two arms.
Alternative optimistic TS that keeps posterior variance unchanged but adds a mean bonus, also proven to be stable.
Demonstration that both optimistic variants achieve asymptotically valid inference (e.g., confidence intervals) while incurring only a small regret penalty.
Formal connection between optimism (a classic exploration principle) and statistical stability under adaptive data collection.

Methodology

Problem setting – The authors consider the standard stochastic (K)-armed Gaussian bandit: each arm (i) yields i.i.d. rewards (r_{i,t}\sim\mathcal N(\mu_i, \sigma^2)).
Thompson Sampling baseline – At each round, TS draws a sample from the posterior of each arm’s mean and selects the arm with the highest sampled value.
Instability issue – Because the number of pulls (N_i(t)) for each arm is random and coupled with the rewards, the classic Central Limit Theorem (CLT) for the sample mean may fail; the pull counts need to “concentrate” around deterministic rates.
Optimistic modifications
- Variance‑inflated TS (from Halder et al. 2025): artificially enlarge the posterior variance by a factor >1 before sampling.
- Mean‑bonus TS (new): add a deterministic optimism bonus (\beta_t) to the posterior mean while leaving variance untouched.
Stability analysis – Using martingale concentration, coupling arguments, and asymptotic normality tools, the authors prove that under either modification the pull counts (N_i(t)) satisfy
[ \frac{N_i(t)}{t} \xrightarrow{p} \lambda_i \quad\text{for some deterministic }\lambda_i>0, ]
which is the stability condition required for valid inference.
Regret evaluation – They bound the extra regret introduced by optimism, showing it grows only logarithmically with time, i.e., the cost is mild compared to the benefit of stable inference.

Results & Findings

Variant	Stability (proved)	Regret overhead	Practical inference
Standard TS	No (fails when multiple optimal arms)	–	Confidence intervals can be misleading
Variance‑inflated TS	✅ for any (K)	(O(\log T)) extra regret	Asymptotically correct confidence intervals
Mean‑bonus TS	✅ for any (K)	(O(\log T)) extra regret	Same inference guarantees, simpler implementation

The key takeaway is that adding optimism—either by inflating variance or by a mean bonus—forces each arm to be pulled often enough for the CLT to kick in, even when the algorithm is aggressively exploiting the best arms.

Practical Implications

A/B testing & online experimentation – When you run multi‑variant tests that adaptively allocate traffic (e.g., bandit‑driven feature rollouts), using an optimistic TS variant lets you compute valid confidence intervals on conversion rates without resorting to costly fixed‑sample designs.
Reinforcement learning pipelines – Many RL systems use bandit‑style exploration for hyperparameter tuning or policy selection. Plug‑in an optimism bonus to the posterior mean and you retain statistical guarantees for downstream performance estimates.
Production services – Implementing the mean‑bonus version is straightforward (just add a decaying bonus term to the sampled mean). The extra regret is negligible for typical traffic volumes, making it a low‑risk upgrade over vanilla TS.
Tooling – Libraries such as bandit, MABWiser, or custom Python/Go services can expose an “optimistic” flag that internally applies the variance inflation or mean bonus, giving developers a ready‑made statistically‑sound exploration strategy.

Limitations & Future Work

The analysis assumes Gaussian rewards with known variance; extending the stability proofs to bounded or heavy‑tailed reward distributions remains open.
The optimism parameters (inflation factor or bonus schedule) are theoretically motivated but may need empirical tuning for specific domains.
The work focuses on asymptotic inference; finite‑sample confidence interval calibration (e.g., via bootstrap) is not addressed.
Future research could explore contextual bandits, where the optimism mechanism must interact with high‑dimensional feature representations, and investigate whether similar stability guarantees hold.

Authors

Shunxing Yan
Han Zhong

Paper Information

arXiv ID: 2602.06014v1
Categories: cs.LG, cs.AI, math.OC, math.ST, stat.ML
Published: February 5, 2026
PDF: Download PDF

[Paper] Optimism Stabilizes Thompson Sampling for Adaptive Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data