[Paper] Multi-Armed Sequential Hypothesis Testing by Betting

Published: 2 days ago (March 18, 2026 at 01:01 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.17925v1

Overview

The paper “Multi‑Armed Sequential Hypothesis Testing by Betting” tackles a practical problem that shows up in A/B‑testing, clinical trials, and online experimentation: you have several data streams (or “arms”) and you must decide, on the fly, which one to sample while testing whether any of them deviates from a global null hypothesis. By framing the problem as a betting game, the authors derive tests that are provably optimal—they perform as well as if you already knew which arm would give the strongest evidence.

Key Contributions

Multi‑armed betting framework: Extends classic sequential testing by betting to settings with multiple selectable arms.
Oracle‑level optimality: Introduces a formal notion of “log‑optimality” and “expected rejection‑time optimality” that guarantees performance matching an oracle that knows the best arm in advance.
Matching lower and upper bounds: Proves that the proposed tests achieve the best possible trade‑off between type‑I error control and detection speed, for any number of arms.
UCB‑style arm selection algorithm: Designs a modified Upper‑Confidence‑Bound (UCB) algorithm that works even when the reward (evidence) is not directly observable but can be reliably estimated.
Non‑asymptotic concentration for Kelly wealth: Derives new concentration inequalities for the growth rate of betting wealth under optimal (Kelly) strategies, which may be useful beyond hypothesis testing.

Methodology

Betting‑based e‑processes: The authors treat the evidence against the null as capital in a betting game. At each round they place a bet on the next observation; the accumulated capital forms an e‑process that never exceeds a pre‑specified significance level under the null.
Arm selection via estimated rewards: Because the true “reward” (how much an arm would increase the betting wealth) is hidden, they construct unbiased estimators and feed them into a UCB‑type rule. The rule balances exploration (trying less‑sampled arms) and exploitation (focusing on arms that appear to generate large evidence).
Optimal wealth growth (Kelly criterion): For each arm, the betting fraction is chosen to maximize the expected logarithmic growth of wealth, mirroring Kelly’s optimal gambling strategy.
Theoretical analysis: By coupling the UCB selection with the Kelly betting fractions, they prove that the resulting e‑process grows at the same rate as if an oracle had always picked the best arm. This yields both lower bounds (no test can beat this rate) and matching upper bounds (their algorithm attains it).

Results & Findings

Optimal rejection time: The expected number of samples needed to reject the global null is within a constant factor of the oracle benchmark, regardless of how many arms are present.
Robustness to multiple non‑null arms: Even when several arms are truly non‑null, the test automatically concentrates on the most informative one, without prior knowledge.
Empirical validation: Simulations on synthetic Bernoulli and Gaussian arms confirm that the proposed method reaches the theoretical rejection‑time bounds and outperforms naïve strategies (e.g., uniform sampling or fixed‑allocation designs).
Concentration guarantees: The new non‑asymptotic bounds show that the betting wealth does not deviate dramatically from its expected exponential growth, providing strong finite‑sample guarantees.

Practical Implications

A/B/n testing platforms: Engineers can embed the algorithm to continuously allocate traffic to the most promising variant while maintaining strict type‑I error control, reducing the time to detect a winning version.
Adaptive clinical trials: Researchers can test multiple dosage levels or treatment arms simultaneously, stopping early as soon as any dose shows efficacy, without inflating false‑positive risk.
Online monitoring & anomaly detection: Systems that monitor many metrics (e.g., latency across services) can treat each metric as an arm and quickly flag any that deviate from normal behavior.
Resource‑efficient experimentation: Because the method automatically focuses on the strongest signal, it saves data collection costs and speeds up decision‑making in data‑limited environments.

Limitations & Future Work

Assumption of estimable rewards: The theoretical guarantees rely on being able to construct unbiased, low‑variance estimators of each arm’s betting reward; in highly noisy or delayed feedback settings this may be challenging.
Scalability to very large arm sets: While the UCB‑style rule is computationally light, the analysis does not address potential issues when the number of arms grows to thousands or more (e.g., in recommendation systems).
Extension to non‑i.i.d. data: The current framework assumes independent observations per arm; handling temporal dependence or covariate‑shift would broaden applicability.
Empirical benchmarks on real‑world data: Future work could evaluate the method on large‑scale A/B testing logs or clinical trial datasets to demonstrate practical gains over existing industry tools.

Bottom line: By marrying betting‑based sequential testing with a clever arm‑selection strategy, the authors deliver a theoretically optimal, easy‑to‑implement tool for any setting where you need to “pick a winner” among multiple data streams while keeping false alarms under control.

Authors

Ricardo J. Sandoval
Ian Waudby‑Smith
Michael I. Jordan

Paper Information

arXiv ID: 2603.17925v1
Categories: stat.ME, cs.LG, math.ST
Published: March 18, 2026
PDF: Download PDF

[Paper] Multi-Armed Sequential Hypothesis Testing by Betting

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules