numr 0.5.0: The Rust numerical computing library that doesn't make you choose
Source: Dev.to
Foundational numerical computing for Rust
numr provides n‑dimensional tensors, linear algebra, FFT, statistics, and automatic differentiation—with native GPU acceleration across CPU, CUDA, and WebGPU backends.
It is “NumPy in Rust” but built with gradients, GPUs, and modern dtypes from day one.
What numr Is
- A foundation library – mathematical building blocks for higher‑level libraries and applications.
What numr Is Not
- Not just a tensor library (like NumPy’s
ndarray). - Not a deep‑learning framework.
- Not a high‑level ML API.
- Not a collection of domain‑specific tools.
Core Features
| Feature | Description |
|---|---|
| Tensor library | N‑dimensional tensors (like NumPy’s ndarray). |
| Linear algebra | Decompositions, solvers, etc. |
| FFT, statistics, random distributions | Comprehensive scientific‑computing primitives. |
| Automatic differentiation | Built‑in numr::autograd. |
| Native GPU support | CUDA + WebGPU backends, with autograd. |
| Cross‑platform GPU | Works on NVIDIA, AMD, Intel, Apple silicon (via WebGPU). |
| FP8 & quantized kernels | FP8 matmul, i8×i8→i32 matmul, 2:4 structured sparsity. |
| Fused kernels | GEMM + bias + activation, activation‑mul, add‑norm, etc. |
| CUDA‑specific improvements | Caching allocator, graph capture, GEMV fast paths, pipelined D2H copy. |
For SciPy‑equivalent functionality (optimization, ODE, interpolation, signal), see the companion crate [solvr].
Why numr? – Comparison with NumPy
| Capability | NumPy | numr |
|---|---|---|
| N‑dimensional tensors | ✓ | ✓ |
| Linear algebra, FFT, stats | ✓ | ✓ |
| Automatic differentiation | ✗ (needs JAX/PyTorch) | ✓ (built‑in numr::autograd) |
| GPU acceleration | ✗ (needs CuPy/JAX) | ✓ (native CUDA + WebGPU) |
| Non‑NVIDIA GPUs | ✗ | ✓ (AMD, Intel, Apple via WebGPU) |
| FP8 support | – | ✓ (E4M3 & E5M2) |
| 2:4 structured sparsity | – | ✓ (all backends) |
| Quantized matmul (i8×i8→i32) | – | ✓ (CPU) |
| Fused kernels (GEMM epilogue, activation‑mul, add‑norm) | – | ✓ (CPU, CUDA, WebGPU) |
| Comprehensive autograd (second‑order) | – | ✓ (conv, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, dtype cast, narrow, cat, gather, …) |
The Problem We Solved
- Fragmentation: Existing Rust crates each solve a single problem (e.g.,
ndarrayfor tensors,nalgebrafor linear algebra,rustfftfor FFT). None provide GPU support or autograd, and they use incompatible types and idioms. - Developer burden: You end up writing adapter layers, filing upstream issues, and juggling multiple backends just to get a simple numerical pipeline running on GPUs.
numr removes that burden:
One library, one tensor type, one API – tensors, linalg, FFT, statistics, autograd, GPU.
Write your code once and run it on:
- CPU (AVX‑512, etc.)
- NVIDIA (native CUDA kernels)
- AMD / Intel / Apple silicon (via WebGPU)
Same code, same results.
Release 0.5.0 Highlights
Performance‑critical fused kernels
| Kernel | What it does | Benefit |
|---|---|---|
| GEMM epilogue | matmul + bias + activation in a single launch | 2‑3× speed‑up for neural‑network inner loops (forward & backward) |
| Activation‑mul | Fused multiply for gated architectures (e.g., SwiGLU) | One read instead of three |
| Add‑norm | Residual connection + normalization fused | One read per transformer layer |
All kernels run on CPU, CUDA, and WebGPU, and each has a correct backward pass.
FP8 & Quantization
- FP8 matmul (E4M3 & E5M2) across all backends – crucial for fitting large models in VRAM.
- i8×i8→i32 quantized matmul on CPU – enables efficient inference without a GPU.
Structured Sparsity
- 2:4 sparsity support on every backend.
- On CUDA it hits the hardware fast path; on CPU/WebGPU it uses optimized sparse kernels.
Autograd Expansion
All of the following are now differentiable (including second‑order derivatives):
conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add‑norm, dtype cast, narrow, cat, gather, …
Additional features:
- Activation checkpointing – trade compute for memory.
- Backward hooks – trigger distributed gradient sync during backprop.
This is not an ML framework; it is the autograd engine that frameworks can be built on.
CUDA‑specific Improvements
- Caching allocator – reuses memory blocks on the Rust side, dramatically cutting allocation overhead.
- Graph capture – record a sequence of kernel launches once and replay with zero overhead (essential for high‑throughput inference).
- GEMV fast paths – specialized kernels for the common case where one matrix dimension is tiny (e.g., batch‑size 1 inference).
- Pipelined D2H copy – overlap GPU computation with host‑side data transfer.
These upgrades move numr from an “interesting foundation” to a production‑ready library.
Getting Started
# Cargo.toml
[dependencies]
numr = "0.5"use numr::{Tensor, autograd::grad};
fn main() {
// Example: fused GEMM + bias + activation on the GPU
let a = Tensor::randn([128, 256], Device::Cuda);
let b = Tensor::randn([256, 512], Device::Cuda);
let bias = Tensor::zeros([512], Device::Cuda);
// Forward pass (fused)
let y = (a @ b + bias).relu();
// Backward pass
let grads = grad(&y, &[a, b, bias]);
println!("Gradients computed!");
}For a full tutorial, see the [GitHub repository].
Vision
- Unified scientific‑computing stack for Rust, comparable to the Python ecosystem but without the fragmentation.
- Zero‑copy, zero‑overhead GPU pipelines that work on any modern GPU.
- Extensible foundation for higher‑level ML libraries, simulation tools, and more.
If you’re tired of stitching together incompatible crates, give numr a try. Write once, run everywhere.
Overview
numr 0.5.0 is a scientific‑computing library that provides:
- Optimization routines
- ODE solvers
- Interpolation utilities
It serves as the foundation for other Rust crates:
- solvr – builds and runs on numr 0.5.0, offering scientific‑computing features (optimization, ODE solvers, interpolation).
- boostr – an ML framework with attention, Mixture‑of‑Experts (MoE), and Mamba blocks, also built on numr 0.5.0.
Both downstream libraries support end‑to‑end LLM inference and embedding generation.
Key Benefits
- Fused kernels – eliminate unnecessary performance overhead.
- Full autograd coverage – enables differentiation through realistic computation graphs.
- CUDA infrastructure – ensures GPU workloads run efficiently.
- Cross‑platform consistency – the same code works on CPU, CUDA, and WebGPU back‑ends.
Release Highlights
| Version | Highlights |
|---|---|
| 0.5.0 | Unblocks new releases of solvr (scientific computing) and boostr (ML framework), both built on numr. |
| 0.6.0 | Focuses on hardening: cleaning up error handling, auditing API stability, and preparing for an eventual 1.0 release. |
| 0.7.0+ (roadmap) | Adds native AMD GPU support via ROCm. |
Dependencies
[dependencies]
numr = "0.5.0"With GPU support
# CUDA support
numr = { version = "0.5.0", features = ["cuda"] }
# WebGPU (wgpu) support
numr = { version = "0.5.0", features = ["wgpu"] }Project Links
- GitHub:
- crates.io:
License
numr is released under the Apache‑2.0 license. Contributions are welcome.