nabla — Pure Rust GPU math engine: PyTorch-familiar API, zero C++ deps, 4 backends
Source: Dev.to
Introduction
I got tired of wiring cuBLAS through bindgen FFI and hand‑deriving gradients just to do GPU math in Rust, so I built nabla.
Features
-
Linear algebra primitives: matrix multiplication (
a * &b), solving linear systems (a.solve(&b)?), singular value decomposition (a.svd()), etc. -
Kernel fusion:
fuse!(x.sin().powf(2.0); x)Multiple operations are combined into a single GPU kernel.
-
Einstein summation:
einsum!(c[i,j] = a[i,k] * b[k,j]) -
Reverse‑mode autodiff (PyTorch‑style):
loss.backward(); w.grad(); -
Four mutually exclusive backends (chosen at build time): CPU, wgpu, CUDA, HIP.
nabla is not a framework—there is no model zoo or pretrained weights. Every mathematically fixed primitive (matmul, convolution, softmax, cross‑entropy, …) is optimized for CPU/GPU, and you compose them yourself.
Backends
- CPU – pure Rust implementation, no external BLAS/LAPACK.
- wgpu – cross‑platform GPU backend (Vulkan, Metal, DX12, …).
- CUDA – NVIDIA GPUs.
- HIP – AMD GPUs.
Only one backend can be enabled per build.
Benchmarks (GH200)
- Eager mode: nabla is 4–6× faster than PyTorch on MLP training.
- CUDA Graph: nabla wins when batch size ≥ 128.
- Matmul (4096 × 4096, TF32): 7.5× faster than PyTorch.
Reproducibility
cd benchmarks && bash run.sh
The benchmark scripts are deterministic and can be rerun to verify the results.
Tests
nabla is a pure‑Rust library with 293 tests and no C++ dependencies (no LAPACK, no BLAS).