numr 0.5.0: The Rust numerical computing library that doesn't make you choose

Published: (March 14, 2026 at 04:15 PM EDT)
6 min read
Source: Dev.to

Source: Dev.to

Foundational numerical computing for Rust

numr provides n‑dimensional tensors, linear algebra, FFT, statistics, and automatic differentiation—with native GPU acceleration across CPU, CUDA, and WebGPU backends.
It is “NumPy in Rust” but built with gradients, GPUs, and modern dtypes from day one.

What numr Is

  • A foundation library – mathematical building blocks for higher‑level libraries and applications.

What numr Is Not

  • Not just a tensor library (like NumPy’s ndarray).
  • Not a deep‑learning framework.
  • Not a high‑level ML API.
  • Not a collection of domain‑specific tools.

Core Features

FeatureDescription
Tensor libraryN‑dimensional tensors (like NumPy’s ndarray).
Linear algebraDecompositions, solvers, etc.
FFT, statistics, random distributionsComprehensive scientific‑computing primitives.
Automatic differentiationBuilt‑in numr::autograd.
Native GPU supportCUDA + WebGPU backends, with autograd.
Cross‑platform GPUWorks on NVIDIA, AMD, Intel, Apple silicon (via WebGPU).
FP8 & quantized kernelsFP8 matmul, i8×i8→i32 matmul, 2:4 structured sparsity.
Fused kernelsGEMM + bias + activation, activation‑mul, add‑norm, etc.
CUDA‑specific improvementsCaching allocator, graph capture, GEMV fast paths, pipelined D2H copy.

For SciPy‑equivalent functionality (optimization, ODE, interpolation, signal), see the companion crate [solvr].

Why numr? – Comparison with NumPy

CapabilityNumPynumr
N‑dimensional tensors
Linear algebra, FFT, stats
Automatic differentiation✗ (needs JAX/PyTorch)✓ (built‑in numr::autograd)
GPU acceleration✗ (needs CuPy/JAX)✓ (native CUDA + WebGPU)
Non‑NVIDIA GPUs✓ (AMD, Intel, Apple via WebGPU)
FP8 support✓ (E4M3 & E5M2)
2:4 structured sparsity✓ (all backends)
Quantized matmul (i8×i8→i32)✓ (CPU)
Fused kernels (GEMM epilogue, activation‑mul, add‑norm)✓ (CPU, CUDA, WebGPU)
Comprehensive autograd (second‑order)✓ (conv, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, dtype cast, narrow, cat, gather, …)

The Problem We Solved

  • Fragmentation: Existing Rust crates each solve a single problem (e.g., ndarray for tensors, nalgebra for linear algebra, rustfft for FFT). None provide GPU support or autograd, and they use incompatible types and idioms.
  • Developer burden: You end up writing adapter layers, filing upstream issues, and juggling multiple backends just to get a simple numerical pipeline running on GPUs.

numr removes that burden:

One library, one tensor type, one API – tensors, linalg, FFT, statistics, autograd, GPU.

Write your code once and run it on:

  • CPU (AVX‑512, etc.)
  • NVIDIA (native CUDA kernels)
  • AMD / Intel / Apple silicon (via WebGPU)

Same code, same results.

Release 0.5.0 Highlights

Performance‑critical fused kernels

KernelWhat it doesBenefit
GEMM epiloguematmul + bias + activation in a single launch2‑3× speed‑up for neural‑network inner loops (forward & backward)
Activation‑mulFused multiply for gated architectures (e.g., SwiGLU)One read instead of three
Add‑normResidual connection + normalization fusedOne read per transformer layer

All kernels run on CPU, CUDA, and WebGPU, and each has a correct backward pass.

FP8 & Quantization

  • FP8 matmul (E4M3 & E5M2) across all backends – crucial for fitting large models in VRAM.
  • i8×i8→i32 quantized matmul on CPU – enables efficient inference without a GPU.

Structured Sparsity

  • 2:4 sparsity support on every backend.
  • On CUDA it hits the hardware fast path; on CPU/WebGPU it uses optimized sparse kernels.

Autograd Expansion

All of the following are now differentiable (including second‑order derivatives):

conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add‑norm, dtype cast, narrow, cat, gather, …

Additional features:

  • Activation checkpointing – trade compute for memory.
  • Backward hooks – trigger distributed gradient sync during backprop.

This is not an ML framework; it is the autograd engine that frameworks can be built on.

CUDA‑specific Improvements

  • Caching allocator – reuses memory blocks on the Rust side, dramatically cutting allocation overhead.
  • Graph capture – record a sequence of kernel launches once and replay with zero overhead (essential for high‑throughput inference).
  • GEMV fast paths – specialized kernels for the common case where one matrix dimension is tiny (e.g., batch‑size 1 inference).
  • Pipelined D2H copy – overlap GPU computation with host‑side data transfer.

These upgrades move numr from an “interesting foundation” to a production‑ready library.

Getting Started

# Cargo.toml
[dependencies]
numr = "0.5"
use numr::{Tensor, autograd::grad};

fn main() {
    // Example: fused GEMM + bias + activation on the GPU
    let a = Tensor::randn([128, 256], Device::Cuda);
    let b = Tensor::randn([256, 512], Device::Cuda);
    let bias = Tensor::zeros([512], Device::Cuda);

    // Forward pass (fused)
    let y = (a @ b + bias).relu();

    // Backward pass
    let grads = grad(&y, &[a, b, bias]);
    println!("Gradients computed!");
}

For a full tutorial, see the [GitHub repository].

Vision

  • Unified scientific‑computing stack for Rust, comparable to the Python ecosystem but without the fragmentation.
  • Zero‑copy, zero‑overhead GPU pipelines that work on any modern GPU.
  • Extensible foundation for higher‑level ML libraries, simulation tools, and more.

If you’re tired of stitching together incompatible crates, give numr a try. Write once, run everywhere.

Overview

numr 0.5.0 is a scientific‑computing library that provides:

  • Optimization routines
  • ODE solvers
  • Interpolation utilities

It serves as the foundation for other Rust crates:

  • solvr – builds and runs on numr 0.5.0, offering scientific‑computing features (optimization, ODE solvers, interpolation).
  • boostr – an ML framework with attention, Mixture‑of‑Experts (MoE), and Mamba blocks, also built on numr 0.5.0.

Both downstream libraries support end‑to‑end LLM inference and embedding generation.

Key Benefits

  • Fused kernels – eliminate unnecessary performance overhead.
  • Full autograd coverage – enables differentiation through realistic computation graphs.
  • CUDA infrastructure – ensures GPU workloads run efficiently.
  • Cross‑platform consistency – the same code works on CPU, CUDA, and WebGPU back‑ends.

Release Highlights

VersionHighlights
0.5.0Unblocks new releases of solvr (scientific computing) and boostr (ML framework), both built on numr.
0.6.0Focuses on hardening: cleaning up error handling, auditing API stability, and preparing for an eventual 1.0 release.
0.7.0+ (roadmap)Adds native AMD GPU support via ROCm.

Dependencies

[dependencies]
numr = "0.5.0"

With GPU support

# CUDA support
numr = { version = "0.5.0", features = ["cuda"] }

# WebGPU (wgpu) support
numr = { version = "0.5.0", features = ["wgpu"] }
  • GitHub:
  • crates.io:

License

numr is released under the Apache‑2.0 license. Contributions are welcome.

0 views
Back to Blog

Related posts

Read more »

Travigo

Travel as fast as you speak with Gemini! Where live agents meet immersive storytelling & 3D navigation. This project was created for entering the Gemini Live Ag...

Micro games

Hey Gamers! 👾 As part of the Rapid Games Prototyping module, we are tasked with reviewing a peer's game. The challenge is to analyse a prototype built in just...