When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA

Published: 1 month ago (December 25, 2025 at 07:07 PM EST)

4 min read

Source: Dev.to

Before we dive in, I want to acknowledge Shreyan Ghosh (@zenoguy) and his wonderful article “When Time Became a Variable — Notes From My Journey With Numba” (dev.to link).

His piece captured something beautiful about computing: the joy of experimentation, the thrill of watching code go fast, and the curiosity to ask “what if?”.

“Somewhere between algorithms and hardware, Numba didn’t just make my code faster. It made exploration lighter.”

Reading his benchmarks, I couldn’t help but wonder: What happens when we throw Rust into the mix? What about raw CUDA? Where does the hardware actually give up?

So I built a dojo. Let’s spar.

🎯 The Challenge

Same challenge as Shreyan’s original experiment:

f(x) = sqrt(x² + 1) × sin(x) + cos(x/2)

Compute this for 20 million elements.
Simple math. Maximum optimization. Who wins?

🥊 The Contenders

Team Python 🐍

Variant	Description
Pure Python	Baseline implementation (interpreter overhead, GIL‑bound).
NumPy Vectorized	Standard NumPy‑based approach.
Numba JIT	Single‑threaded compiled code via Numba.
Numba Parallel	Multi‑threaded version using `prange`.
Numba @vectorize	Parallel ufunc implementation.

Team Rust 🦀

Variant	Description
Single‑threaded	Idiomatic iterator‑based code.
Parallel (Rayon)	Work‑stealing parallelism with the Rayon crate.
Parallel Chunks	Cache‑optimized chunked processing.

Team GPU 🎮

Variant	Description
Numba CUDA	Python kernels executed on the GPU.
CUDA C++ FP64	Double‑precision native implementation.
CUDA C++ FP32	Single‑precision native implementation.
CUDA C++ Intrinsics	Hardware‑optimized math intrinsics.

🏗️ The Setup

I wanted this to be reproducible and fair:

Same computation across all implementations.
Same array size (20 million float64 elements).
Same random seed (42).
Multiple warm‑up runs to eliminate JIT/cache effects.
Take the minimum of multiple runs (least noise).

The full benchmark suite is open source: github.com/copyleftdev/numba-dojo

# Run everything yourself
git clone https://github.com/copyleftdev/numba-dojo.git
cd numba-dojo
make all

📊 The Results

The Full Leaderboard

Rank	Implementation	Time (ms)	Speedup vs NumPy
🥇	CUDA C++ FP32	0.21	3,255×
🥈	Numba CUDA FP32	2.52	265×
🥉	CUDA C++ FP64	4.11	162×
4	Numba CUDA FP64	4.14	161×
5	Rust Parallel	12.39	54×
6	Numba @vectorize	14.86	45×
7	Numba Parallel	15.55	43×
8	Rust Single	555.62	1.2×
9	Numba JIT	558.30	1.2×
10	NumPy Vectorized	667.30	1.0×
11	Pure Python	~6,650	0.1×

Benchmark Results

Speedup Visualization

Speedup Chart

Category Champions

Category Comparison

🔬 What I Learned

1. GPU ≫ CPU (when it fits)

RTX 3080 Ti: 0.21 ms → 3,255× faster than NumPy.
For embarrassingly parallel, element‑wise workloads, GPUs are in a different league.
The massive parallelism (80 SMs, thousands of cores) completely crushes sequential execution.

2. FP32 ≈ 20× faster than FP64 on consumer GPUs

CUDA FP64: 4.11 ms
CUDA FP32: 0.21 ms   ← 20× faster!

Consumer GeForce GPUs have very few FP64 units (≈ 1/32 the throughput of FP32).
If your algorithm tolerates single‑precision, use FP32.

3. Rust ≈ Numba JIT (single‑threaded)

Rust (single‑threaded): 555.62 ms
Numba JIT:             558.30 ms

Both compile to LLVM IR and generate almost identical code.
The tiny difference is just noise, confirming Numba’s claim: “Feels like Python, behaves like C.”

4. Rust beats Numba in parallel (~20 % faster)

Rust Parallel (Rayon): 12.39 ms
Numba Parallel:       15.55 ms

Rayon’s work‑stealing scheduler has lower overhead than Numba’s threading.
For CPU‑parallel workloads in production, Rust has a clear edge.

5. We hit the memory‑bandwidth wall

Profiling the FP32 CUDA kernel gave:

Time:       0.21 ms
Bandwidth:  ~777 GB/s achieved
Theoretical: 912 GB/s (RTX 3080 Ti)
Efficiency: 85 %

The GPU is running at 85 % of peak memory bandwidth.
The cores are largely idle; the bottleneck is moving data in and out of memory.

Bottom line

GPUs dominate pure data‑parallel work.
FP32 is the sweet spot on consumer hardware.
Rust holds its own (and even outpaces Numba) on the CPU.
Memory bandwidth is the ultimate ceiling for both.

Feel free to clone the repo and run the benchmarks yourself!

Waiting for data* – No algorithm can beat physics

This is the Roofline Model in action:

                    Peak Compute
                         /
                        /
Performance            /
                      /  ←  We're here (memory‑bound)
                     /
                    /
            ──────────────────────
                Memory Bandwidth

For this workload the arithmetic intensity is low (few operations per byte), so we’ve hit the memory‑bandwidth ceiling.

🧪 The Code

Below are three self‑contained implementations of the same kernel.

1️⃣ Numba (the hero of the original article)

from numba import njit, prange
import numpy as np

@njit(parallel=True, fastmath=True, cache=True)
def compute_numba_parallel(arr, out):
    """Compute sqrt(x²+1)·sin(x) + cos(0.5·x) element‑wise."""
    n = len(arr)
    for i in prange(n):
        x = arr[i]
        out[i] = np.sqrt(x * x + 1.0) * np.sin(x) + np.cos(0.5 * x)

Just add @njit; the rest is pure NumPy‑style Python.

2️⃣ Rust (the challenger)

use rayon::prelude::*;

/// Compute sqrt(x²+1)·sin(x) + cos(0.5·x) element‑wise.
pub fn compute_parallel(arr: &[f64], out: &mut [f64]) {
    out.par_iter_mut()
        .zip(arr.par_iter())
        .for_each(|(o, &x)| {
            *o = (x * x + 1.0).sqrt() * x.sin() + (0.5 * x).cos();
        });
}

rayon makes data‑parallelism feel as natural as ordinary iterators.

3️⃣ CUDA C++ (the champion)

#include <cuda_runtime.h>
#include <cmath>

__global__ void compute_fp32(const float *arr, float *out, size_t n) {
    // One thread per element
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float x = arr[idx];
        out[idx] = sqrtf(x * x + 1.0f) * sinf(x) + cosf(0.5f * x);
    }
}

/* Helper to launch the kernel */
void launch_compute(const float *d_arr, float *d_out, size_t n) {
    const int threads_per_block = 256;
    const int blocks = (n + threads_per_block - 1) / threads_per_block;
    compute_fp32<<<blocks, threads_per_block>>>(d_arr, d_out, n);
    cudaDeviceSynchronize();   // error checking omitted for brevity
}

A straightforward grid‑stride kernel that maps one thread to each array element.

References

Original inspiration – the article that sparked this comparison.
Numba documentation – https://numba.pydata.org/
Rayon (Rust) – https://github.com/rayon-rs/rayon
Roofline Model – https://en.wikipedia.org/wiki/Roofline_model

Keep experimenting. Keep playing. That’s what computing is for. ✨

What’s your favorite performance‑optimization story? Drop it in the comments!

When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA

🎯 The Challenge

🥊 The Contenders

Team Python 🐍

Team Rust 🦀

Team GPU 🎮

🏗️ The Setup

📊 The Results

The Full Leaderboard

Speedup Visualization

Category Champions

🔬 What I Learned

1. GPU ≫ CPU (when it fits)

2. FP32 ≈ 20× faster than FP64 on consumer GPUs

3. Rust ≈ Numba JIT (single‑threaded)

4. Rust beats Numba in parallel (~20 % faster)

5. We hit the memory‑bandwidth wall

Bottom line

Waiting for data* – No algorithm can beat physics

🧪 The Code

1️⃣ Numba (the hero of the original article)

2️⃣ Rust (the challenger)

3️⃣ CUDA C++ (the champion)

References

Related posts

RustOCR – Now do OCR 5-10x faster than EasyOCR

I Follow Web Dev Trends So Recruiters Don’t Have To

Orbis: Building a Plugin-Driven Desktop Platform with Rust and React

Master Rust Parallelism: Write Safe, Fast Concurrent Code with Rayon and Zero Race Conditions

🎯 The Challenge

🥊 The Contenders

Team Python 🐍

Team Rust 🦀

Team GPU 🎮

🏗️ The Setup

📊 The Results

The Full Leaderboard

Speedup Visualization

Category Champions

🔬 What I Learned

1. GPU ≫ CPU (when it fits)

2. FP32 ≈ 20× faster than FP64 on consumer GPUs

3. Rust ≈ Numba JIT (single‑threaded)

4. Rust beats Numba in parallel (~20 % faster)

5. We hit the memory‑bandwidth wall

Bottom line

Waiting for data* – No algorithm can beat physics

🧪 The Code

1️⃣ Numba (the hero of the original article)

2️⃣ Rust (the challenger)

3️⃣ CUDA C++ (the champion)

References

Related posts

RustOCR – Now do OCR 5-10x faster than EasyOCR

I Follow Web Dev Trends So Recruiters Don’t Have To

Orbis: Building a Plugin-Driven Desktop Platform with Rust and React

Master Rust Parallelism: Write Safe, Fast Concurrent Code with Rayon and Zero Race Conditions

1. GPU ≫ CPU (when it fits)

2. FP32 ≈ 20× faster than FP64 on consumer GPUs

4. Rust beats Numba in parallel (~20 % faster)

3️⃣ CUDA C++ (the champion)