numr 0.5.0: The Rust numerical computing library that doesn't make you choose

Published: 1 month ago (March 14, 2026 at 04:15 PM EDT)

6 min read

Source: Dev.to

Source: Dev.to

Foundational numerical computing for Rust

numr provides n‑dimensional tensors, linear algebra, FFT, statistics, and automatic differentiation—with native GPU acceleration across CPU, CUDA, and WebGPU backends.
It is “NumPy in Rust” but built with gradients, GPUs, and modern dtypes from day one.

What numr Is

A foundation library – mathematical building blocks for higher‑level libraries and applications.

What numr Is Not

Not just a tensor library (like NumPy’s ndarray).
Not a deep‑learning framework.
Not a high‑level ML API.
Not a collection of domain‑specific tools.

Core Features

Feature	Description
Tensor library	N‑dimensional tensors (like NumPy’s `ndarray`).
Linear algebra	Decompositions, solvers, etc.
FFT, statistics, random distributions	Comprehensive scientific‑computing primitives.
Automatic differentiation	Built‑in `numr::autograd`.
Native GPU support	CUDA + WebGPU backends, with autograd.
Cross‑platform GPU	Works on NVIDIA, AMD, Intel, Apple silicon (via WebGPU).
FP8 & quantized kernels	FP8 matmul, i8×i8→i32 matmul, 2:4 structured sparsity.
Fused kernels	GEMM + bias + activation, activation‑mul, add‑norm, etc.
CUDA‑specific improvements	Caching allocator, graph capture, GEMV fast paths, pipelined D2H copy.

For SciPy‑equivalent functionality (optimization, ODE, interpolation, signal), see the companion crate [solvr].

Why numr? – Comparison with NumPy

Capability	NumPy	numr
N‑dimensional tensors	✓	✓
Linear algebra, FFT, stats	✓	✓
Automatic differentiation	✗ (needs JAX/PyTorch)	✓ (built‑in `numr::autograd`)
GPU acceleration	✗ (needs CuPy/JAX)	✓ (native CUDA + WebGPU)
Non‑NVIDIA GPUs	✗	✓ (AMD, Intel, Apple via WebGPU)
FP8 support	–	✓ (E4M3 & E5M2)
2:4 structured sparsity	–	✓ (all backends)
Quantized matmul (i8×i8→i32)	–	✓ (CPU)
Fused kernels (GEMM epilogue, activation‑mul, add‑norm)	–	✓ (CPU, CUDA, WebGPU)
Comprehensive autograd (second‑order)	–	✓ (conv, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, dtype cast, narrow, cat, gather, …)

The Problem We Solved

Fragmentation: Existing Rust crates each solve a single problem (e.g., ndarray for tensors, nalgebra for linear algebra, rustfft for FFT). None provide GPU support or autograd, and they use incompatible types and idioms.
Developer burden: You end up writing adapter layers, filing upstream issues, and juggling multiple backends just to get a simple numerical pipeline running on GPUs.

numr removes that burden:

One library, one tensor type, one API – tensors, linalg, FFT, statistics, autograd, GPU.

Write your code once and run it on:

CPU (AVX‑512, etc.)
NVIDIA (native CUDA kernels)
AMD / Intel / Apple silicon (via WebGPU)

Same code, same results.

Release 0.5.0 Highlights

Performance‑critical fused kernels

Kernel	What it does	Benefit
GEMM epilogue	`matmul + bias + activation` in a single launch	2‑3× speed‑up for neural‑network inner loops (forward & backward)
Activation‑mul	Fused multiply for gated architectures (e.g., SwiGLU)	One read instead of three
Add‑norm	Residual connection + normalization fused	One read per transformer layer

All kernels run on CPU, CUDA, and WebGPU, and each has a correct backward pass.

FP8 & Quantization

FP8 matmul (E4M3 & E5M2) across all backends – crucial for fitting large models in VRAM.
i8×i8→i32 quantized matmul on CPU – enables efficient inference without a GPU.

Structured Sparsity

2:4 sparsity support on every backend.
On CUDA it hits the hardware fast path; on CPU/WebGPU it uses optimized sparse kernels.

Autograd Expansion

All of the following are now differentiable (including second‑order derivatives):

conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add‑norm, dtype cast, narrow, cat, gather, …

Additional features:

Activation checkpointing – trade compute for memory.
Backward hooks – trigger distributed gradient sync during backprop.

This is not an ML framework; it is the autograd engine that frameworks can be built on.

CUDA‑specific Improvements

Caching allocator – reuses memory blocks on the Rust side, dramatically cutting allocation overhead.
Graph capture – record a sequence of kernel launches once and replay with zero overhead (essential for high‑throughput inference).
GEMV fast paths – specialized kernels for the common case where one matrix dimension is tiny (e.g., batch‑size 1 inference).
Pipelined D2H copy – overlap GPU computation with host‑side data transfer.

These upgrades move numr from an “interesting foundation” to a production‑ready library.

Getting Started

# Cargo.toml
[dependencies]
numr = "0.5"

use numr::{Tensor, autograd::grad};

fn main() {
    // Example: fused GEMM + bias + activation on the GPU
    let a = Tensor::randn([128, 256], Device::Cuda);
    let b = Tensor::randn([256, 512], Device::Cuda);
    let bias = Tensor::zeros([512], Device::Cuda);

    // Forward pass (fused)
    let y = (a @ b + bias).relu();

    // Backward pass
    let grads = grad(&y, &[a, b, bias]);
    println!("Gradients computed!");
}

For a full tutorial, see the [GitHub repository].

Vision

Unified scientific‑computing stack for Rust, comparable to the Python ecosystem but without the fragmentation.
Zero‑copy, zero‑overhead GPU pipelines that work on any modern GPU.
Extensible foundation for higher‑level ML libraries, simulation tools, and more.

If you’re tired of stitching together incompatible crates, give numr a try. Write once, run everywhere.

Overview

numr 0.5.0 is a scientific‑computing library that provides:

Optimization routines
ODE solvers
Interpolation utilities

It serves as the foundation for other Rust crates:

solvr – builds and runs on numr 0.5.0, offering scientific‑computing features (optimization, ODE solvers, interpolation).
boostr – an ML framework with attention, Mixture‑of‑Experts (MoE), and Mamba blocks, also built on numr 0.5.0.

Both downstream libraries support end‑to‑end LLM inference and embedding generation.

Key Benefits

Fused kernels – eliminate unnecessary performance overhead.
Full autograd coverage – enables differentiation through realistic computation graphs.
CUDA infrastructure – ensures GPU workloads run efficiently.
Cross‑platform consistency – the same code works on CPU, CUDA, and WebGPU back‑ends.

Release Highlights

Version	Highlights
0.5.0	Unblocks new releases of solvr (scientific computing) and boostr (ML framework), both built on numr.
0.6.0	Focuses on hardening: cleaning up error handling, auditing API stability, and preparing for an eventual 1.0 release.
0.7.0+ (roadmap)	Adds native AMD GPU support via ROCm.

Dependencies

[dependencies]
numr = "0.5.0"

With GPU support

# CUDA support
numr = { version = "0.5.0", features = ["cuda"] }

# WebGPU (wgpu) support
numr = { version = "0.5.0", features = ["wgpu"] }

Project Links

GitHub:
crates.io:

License

numr is released under the Apache‑2.0 license. Contributions are welcome.

numr 0.5.0: The Rust numerical computing library that doesn't make you choose

What numr Is

What numr Is Not

Core Features

Why numr? – Comparison with NumPy

The Problem We Solved

Release 0.5.0 Highlights

Performance‑critical fused kernels

FP8 & Quantization

Structured Sparsity

Autograd Expansion

CUDA‑specific Improvements

Getting Started

Vision

Overview

Key Benefits

Release Highlights

Dependencies

With GPU support

Project Links

License

Related posts

Why Open Source AI Tools Are Quietly Winning

Travigo

Trust Debt: The Production Crisis Hidden Inside AI-Generated Codebases

Micro games

What numr Is

What numr Is Not

Core Features

Why numr? – Comparison with NumPy

The Problem We Solved

Release 0.5.0 Highlights

Performance‑critical fused kernels

FP8 & Quantization

Structured Sparsity

Autograd Expansion

CUDA‑specific Improvements

Getting Started

Vision

Overview

Key Benefits

Release Highlights

Dependencies

With GPU support

Project Links

License

Related posts

Why Open Source AI Tools Are Quietly Winning

Travigo

Trust Debt: The Production Crisis Hidden Inside AI-Generated Codebases

Micro games

What numr Is

What numr Is Not

Release 0.5.0 Highlights