nabla — Pure Rust GPU math engine: PyTorch-familiar API, zero C++ deps, 4 backends

Published: 1 day ago (March 1, 2026 at 08:36 PM EST)

2 min read

Source: Dev.to

Introduction

I got tired of wiring cuBLAS through bindgen FFI and hand‑deriving gradients just to do GPU math in Rust, so I built nabla.

Features

Linear algebra primitives: matrix multiplication (a * &b), solving linear systems (a.solve(&b)?), singular value decomposition (a.svd()), etc.
Kernel fusion:
```
fuse!(x.sin().powf(2.0); x)
```
Multiple operations are combined into a single GPU kernel.
Einstein summation:
```
einsum!(c[i,j] = a[i,k] * b[k,j])
```
Reverse‑mode autodiff (PyTorch‑style):
```
loss.backward();
w.grad();
```
Four mutually exclusive backends (chosen at build time): CPU, wgpu, CUDA, HIP.

nabla is not a framework—there is no model zoo or pretrained weights. Every mathematically fixed primitive (matmul, convolution, softmax, cross‑entropy, …) is optimized for CPU/GPU, and you compose them yourself.

Backends

CPU – pure Rust implementation, no external BLAS/LAPACK.
wgpu – cross‑platform GPU backend (Vulkan, Metal, DX12, …).
CUDA – NVIDIA GPUs.
HIP – AMD GPUs.

Only one backend can be enabled per build.

Benchmarks (GH200)

Eager mode: nabla is 4–6× faster than PyTorch on MLP training.
CUDA Graph: nabla wins when batch size ≥ 128.
Matmul (4096 × 4096, TF32): 7.5× faster than PyTorch.

Reproducibility

cd benchmarks && bash run.sh

The benchmark scripts are deterministic and can be rerun to verify the results.

Tests

nabla is a pure‑Rust library with 293 tests and no C++ dependencies (no LAPACK, no BLAS).

nabla — Pure Rust GPU math engine: PyTorch-familiar API, zero C++ deps, 4 backends

Introduction

Features

Backends

Benchmarks (GH200)

Reproducibility

Tests

Related posts

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Semantic Invalidation That Doesn't Suck

I Built a CLI to Find the Riskiest Code in Any Repo — Introducing Hotspots

2025 State of Rust Survey Results