nabla — Pure Rust GPU math engine: PyTorch-familiar API, zero C++ deps, 4 backends

Published: (March 1, 2026 at 08:36 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Introduction

I got tired of wiring cuBLAS through bindgen FFI and hand‑deriving gradients just to do GPU math in Rust, so I built nabla.

Features

  • Linear algebra primitives: matrix multiplication (a * &b), solving linear systems (a.solve(&b)?), singular value decomposition (a.svd()), etc.

  • Kernel fusion:

    fuse!(x.sin().powf(2.0); x)

    Multiple operations are combined into a single GPU kernel.

  • Einstein summation:

    einsum!(c[i,j] = a[i,k] * b[k,j])
  • Reverse‑mode autodiff (PyTorch‑style):

    loss.backward();
    w.grad();
  • Four mutually exclusive backends (chosen at build time): CPU, wgpu, CUDA, HIP.

nabla is not a framework—there is no model zoo or pretrained weights. Every mathematically fixed primitive (matmul, convolution, softmax, cross‑entropy, …) is optimized for CPU/GPU, and you compose them yourself.

Backends

  • CPU – pure Rust implementation, no external BLAS/LAPACK.
  • wgpu – cross‑platform GPU backend (Vulkan, Metal, DX12, …).
  • CUDA – NVIDIA GPUs.
  • HIP – AMD GPUs.

Only one backend can be enabled per build.

Benchmarks (GH200)

  • Eager mode: nabla is 4–6× faster than PyTorch on MLP training.
  • CUDA Graph: nabla wins when batch size ≥ 128.
  • Matmul (4096 × 4096, TF32): 7.5× faster than PyTorch.

Reproducibility

cd benchmarks && bash run.sh

The benchmark scripts are deterministic and can be rerun to verify the results.

Tests

nabla is a pure‑Rust library with 293 tests and no C++ dependencies (no LAPACK, no BLAS).

0 views
Back to Blog

Related posts

Read more »

2025 State of Rust Survey Results

Hello, Rust community! Once again, the survey team is happy to share the results of the State of Rust survey, this year celebrating a round number – the 10th e...