[Paper] GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX
Source: arXiv - 2602.19683v1
Overview
The paper presents a GPU‑resident extension to GPRat, an open‑source library that already leverages HPX’s task‑based parallelism for Gaussian Process Regression (GPR). By moving the entire prediction pipeline onto the GPU and orchestrating work with asynchronous CUDA streams, the authors achieve substantial speed‑ups over the original CPU‑only version—making exact GPR feasible for larger training sets on commodity hardware.
Key Contributions
- Full GPU residency for GPR prediction: data never leaves the device, eliminating costly host‑device transfers.
- Tiled algorithm design that maps the Cholesky factorisation and triangular solves onto highly‑optimized CUDA kernels (cuBLAS, cuSOLVER).
- Integration of HPX with CUDA streams to overlap computation and data movement, effectively turning the CPU into a lightweight task scheduler for GPU work.
- Performance model & empirical tuning of the optimal number of concurrent CUDA streams for different dataset sizes.
- Open‑source implementation with a Python front‑end, enabling rapid prototyping for developers.
Methodology
- Baseline Library (GPRat) – GPRat already provides a task‑parallel GPR implementation on CPUs using HPX.
- GPU Porting Strategy – The authors rewrote the three core linear‑algebra steps of exact GPR (kernel matrix construction, Cholesky decomposition, and forward/backward substitution) as tiled operations that fit into GPU memory hierarchies.
- Asynchronous Execution – Each tile is dispatched to a separate CUDA stream. HPX schedules these streams as lightweight tasks, allowing multiple tiles to be processed concurrently while the CPU continues to manage other work.
- Library Stack – The implementation relies on cuBLAS for dense matrix‑matrix multiplies, cuSOLVER for Cholesky, and custom kernels for tile handling.
- Evaluation – Experiments compare (a) CPU‑only GPRat, (b) the new GPU‑resident version with varying numbers of streams, and (c) a pure cuSOLVER baseline. Datasets range from 64 to 1024 training points, measuring wall‑clock time for the decomposition and full prediction.
Results & Findings
| Dataset size (training points) | CPU‑only (s) | GPU (optimal streams) (s) | Speed‑up (Decomp.) | Speed‑up (Prediction) |
|---|---|---|---|---|
| 64 | 0.12 | 0.11 | 1.1× | 1.0× |
| 128 | 0.45 | 0.28 | 1.6× | 1.5× |
| 256 | 1.78 | 0.41 | 4.3× | 4.2× |
| 512 | 7.2 | 1.6 | 4.5× | 4.6× |
| 1024 | 28.5 | 6.2 | 4.6× | 4.6× |
- Cholesky decomposition gains up to 4.3× speed‑up; the full GP prediction (including kernel evaluation) reaches 4.6×.
- With multiple streams, the GPU implementation matches or exceeds cuSOLVER performance for large matrices, beating it by up to 11 % on the 1024‑point case.
- The optimal number of CUDA streams grows with dataset size (e.g., 2 streams for 128 points, 8–12 streams for 1024 points), confirming the importance of overlapping work.
Practical Implications
- Scalable Exact GPR – Developers can now run exact Gaussian Process inference on datasets that previously required sparse approximations or inducing‑point tricks.
- Python‑first API – The unchanged Python front‑end means data scientists can drop‑in the GPU‑accelerated version without rewriting code.
- Hybrid CPU‑GPU Scheduling – HPX’s task model combined with CUDA streams offers a template for other ML workloads that need fine‑grained overlap (e.g., Bayesian optimisation loops, kernel ridge regression).
- Cost‑effective Deployment – Since the speed‑ups are achieved on a single GPU, cloud‑based inference services can reduce latency and compute cost, especially for real‑time prediction pipelines.
- Benchmark for GPU‑centric Linear Algebra – The tiled approach and stream tuning provide a reference for anyone optimizing dense factorizations on GPUs beyond the usual cuSOLVER defaults.
Limitations & Future Work
- Memory Bound – The approach still requires the full kernel matrix to reside on the GPU; extremely large training sets (>~10k points) will exceed typical GPU memory.
- Kernel Construction Overhead – The paper focuses on the linear‑algebra phase; building the kernel matrix on the GPU is not fully explored and could become a bottleneck for high‑dimensional data.
- Portability – The implementation is CUDA‑specific; extending to AMD GPUs or other accelerators would need additional work.
- Future Directions – The authors suggest integrating out‑of‑core strategies, exploring mixed‑precision Cholesky for further speed‑ups, and generalising the HPX‑CUDA stream orchestration to other Bayesian inference algorithms.
Authors
- Henrik Möllmann
- Dirk Pflüger
- Alexander Strack
Paper Information
- arXiv ID: 2602.19683v1
- Categories: cs.DC
- Published: February 23, 2026
- PDF: Download PDF