[Paper] Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference

Published: 3 months ago (January 30, 2026 at 01:20 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23252v1

Overview

The paper presents Nested Slice Sampling (NSS), a new way to run Nested Sampling on GPUs. By replacing the traditional, sequential “replace‑the‑worst” step with a vectorized Hit‑and‑Run slice sampler, the authors turn a notoriously hard‑to‑parallelize algorithm into something that can exploit thousands of GPU cores. The result is a fast, scalable inference engine that still delivers accurate Bayesian evidence estimates and high‑quality posterior samples—even on multimodal, high‑dimensional problems.

Key Contributions

GPU‑friendly formulation of Nested Sampling that removes the sequential bottleneck.
Hit‑and‑Run Slice Sampling as the constrained proposal mechanism, enabling fully vectorized updates.
Simple, near‑optimal slice‑width rule derived from a thorough tuning analysis, making per‑iteration cost predictable.
Open‑source implementation (Python/Numba + CUDA) that can be dropped into existing Bayesian workflows.
Empirical validation on synthetic multimodal benchmarks, high‑dimensional Bayesian models, and Gaussian‑process hyper‑parameter marginalisation, showing competitive or superior evidence estimates compared to tempered SMC.

Methodology

Nested Sampling works by maintaining a set of “live points” that explore the prior while progressively discarding the lowest‑likelihood point and replacing it with a new point that satisfies a higher likelihood constraint. The classic approach draws the replacement sequentially, which is ill‑suited for GPUs.

NSS re‑thinks this step:

Hit‑and‑Run Slice Sampling – Starting from each live point, a random direction is chosen, and a slice (interval) along that direction is defined by the current likelihood threshold. The algorithm then uniformly samples a new point inside that slice that also respects the constraint.
Vectorization – All live points are updated in parallel: each GPU thread handles one live point, performs the hit‑and‑run move, and checks the likelihood constraint.
Slice‑width tuning – The authors derive a rule of thumb for the slice width that balances exploration and acceptance probability, especially important as dimensionality grows. This rule eliminates the need for expensive per‑iteration tuning.

The overall Nested Sampling loop (updating evidence, shrinking prior volume, etc.) stays unchanged; only the constrained sampling step becomes massively parallel.

Results & Findings

Experiment	Dimensionality	Evidence Error (Δlog Z)	Posterior Quality	Speedup vs. CPU
Multimodal Gaussian mixture	10‑30	≤ 0.05	Accurate mode weights	12× (single GPU)
Bayesian logistic regression (real data)	50	0.03	Comparable to HMC	8×
GP hyper‑parameter marginalisation	20‑40	≤ 0.07	Same predictive performance	10×

Accuracy: Across all benchmarks, NSS matches or exceeds the evidence estimates of state‑of‑the‑art tempered Sequential Monte Carlo (SMC).
Robustness: In highly multimodal settings where SMC sometimes collapses onto a single mode, NSS reliably discovers all modes thanks to the global Hit‑and‑Run moves.
Predictable compute: The slice‑width rule yields a near‑constant number of likelihood evaluations per iteration, making GPU utilisation stable.

Practical Implications

Faster Bayesian model comparison – Teams can now run Nested Sampling on large models (e.g., deep Bayesian nets, hierarchical GLMs) in minutes rather than hours, enabling rapid iteration on model design.
Scalable uncertainty quantification – Engineers building safety‑critical systems (autonomous vehicles, aerospace) can afford to compute full Bayesian evidences for competing designs, improving risk assessment.
GPU‑first pipelines – Because the implementation is pure Python/Numba with CUDA kernels, it plugs into existing PyTorch or JAX workflows without rewriting the model code.
Better handling of multimodality – Applications like astrophysical parameter inference, mixture‑model clustering, or hyper‑parameter optimisation for non‑convex loss surfaces benefit from the algorithm’s ability to jump between distant modes efficiently.

Limitations & Future Work

Memory footprint – Maintaining a large live‑point set on GPU memory can become a bottleneck for extremely high‑dimensional problems (> 200 D).
Slice‑width heuristic – While near‑optimal for the tested cases, the rule may need adjustment for pathological priors (e.g., heavy‑tailed or highly constrained spaces).
Limited to continuous priors – The current Hit‑and‑Run slice sampler assumes differentiable likelihoods; discrete or combinatorial spaces would require a different constrained sampler.
Future directions suggested by the authors include adaptive live‑point allocation, hybrid CPU‑GPU schemes for memory‑heavy models, and extending the framework to handle mixed continuous‑discrete parameter spaces.

Authors

David Yallup
Namu Kroupa
Will Handley

Paper Information

arXiv ID: 2601.23252v1
Categories: stat.CO, cs.LG, stat.ML
Published: January 30, 2026
PDF: Download PDF

[Paper] Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound