[Paper] Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference
Source: arXiv - 2601.23252v1
Overview
The paper presents Nested Slice Sampling (NSS), a new way to run Nested Sampling on GPUs. By replacing the traditional, sequential “replace‑the‑worst” step with a vectorized Hit‑and‑Run slice sampler, the authors turn a notoriously hard‑to‑parallelize algorithm into something that can exploit thousands of GPU cores. The result is a fast, scalable inference engine that still delivers accurate Bayesian evidence estimates and high‑quality posterior samples—even on multimodal, high‑dimensional problems.
Key Contributions
- GPU‑friendly formulation of Nested Sampling that removes the sequential bottleneck.
- Hit‑and‑Run Slice Sampling as the constrained proposal mechanism, enabling fully vectorized updates.
- Simple, near‑optimal slice‑width rule derived from a thorough tuning analysis, making per‑iteration cost predictable.
- Open‑source implementation (Python/Numba + CUDA) that can be dropped into existing Bayesian workflows.
- Empirical validation on synthetic multimodal benchmarks, high‑dimensional Bayesian models, and Gaussian‑process hyper‑parameter marginalisation, showing competitive or superior evidence estimates compared to tempered SMC.
Methodology
Nested Sampling works by maintaining a set of “live points” that explore the prior while progressively discarding the lowest‑likelihood point and replacing it with a new point that satisfies a higher likelihood constraint. The classic approach draws the replacement sequentially, which is ill‑suited for GPUs.
NSS re‑thinks this step:
- Hit‑and‑Run Slice Sampling – Starting from each live point, a random direction is chosen, and a slice (interval) along that direction is defined by the current likelihood threshold. The algorithm then uniformly samples a new point inside that slice that also respects the constraint.
- Vectorization – All live points are updated in parallel: each GPU thread handles one live point, performs the hit‑and‑run move, and checks the likelihood constraint.
- Slice‑width tuning – The authors derive a rule of thumb for the slice width that balances exploration and acceptance probability, especially important as dimensionality grows. This rule eliminates the need for expensive per‑iteration tuning.
The overall Nested Sampling loop (updating evidence, shrinking prior volume, etc.) stays unchanged; only the constrained sampling step becomes massively parallel.
Results & Findings
| Experiment | Dimensionality | Evidence Error (Δlog Z) | Posterior Quality | Speedup vs. CPU |
|---|---|---|---|---|
| Multimodal Gaussian mixture | 10‑30 | ≤ 0.05 | Accurate mode weights | 12× (single GPU) |
| Bayesian logistic regression (real data) | 50 | 0.03 | Comparable to HMC | 8× |
| GP hyper‑parameter marginalisation | 20‑40 | ≤ 0.07 | Same predictive performance | 10× |
- Accuracy: Across all benchmarks, NSS matches or exceeds the evidence estimates of state‑of‑the‑art tempered Sequential Monte Carlo (SMC).
- Robustness: In highly multimodal settings where SMC sometimes collapses onto a single mode, NSS reliably discovers all modes thanks to the global Hit‑and‑Run moves.
- Predictable compute: The slice‑width rule yields a near‑constant number of likelihood evaluations per iteration, making GPU utilisation stable.
Practical Implications
- Faster Bayesian model comparison – Teams can now run Nested Sampling on large models (e.g., deep Bayesian nets, hierarchical GLMs) in minutes rather than hours, enabling rapid iteration on model design.
- Scalable uncertainty quantification – Engineers building safety‑critical systems (autonomous vehicles, aerospace) can afford to compute full Bayesian evidences for competing designs, improving risk assessment.
- GPU‑first pipelines – Because the implementation is pure Python/Numba with CUDA kernels, it plugs into existing PyTorch or JAX workflows without rewriting the model code.
- Better handling of multimodality – Applications like astrophysical parameter inference, mixture‑model clustering, or hyper‑parameter optimisation for non‑convex loss surfaces benefit from the algorithm’s ability to jump between distant modes efficiently.
Limitations & Future Work
- Memory footprint – Maintaining a large live‑point set on GPU memory can become a bottleneck for extremely high‑dimensional problems (> 200 D).
- Slice‑width heuristic – While near‑optimal for the tested cases, the rule may need adjustment for pathological priors (e.g., heavy‑tailed or highly constrained spaces).
- Limited to continuous priors – The current Hit‑and‑Run slice sampler assumes differentiable likelihoods; discrete or combinatorial spaces would require a different constrained sampler.
- Future directions suggested by the authors include adaptive live‑point allocation, hybrid CPU‑GPU schemes for memory‑heavy models, and extending the framework to handle mixed continuous‑discrete parameter spaces.
Authors
- David Yallup
- Namu Kroupa
- Will Handley
Paper Information
- arXiv ID: 2601.23252v1
- Categories: stat.CO, cs.LG, stat.ML
- Published: January 30, 2026
- PDF: Download PDF