When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA
Source: Dev.to
Before we dive in, I want to acknowledge Shreyan Ghosh (@zenoguy) and his wonderful article “When Time Became a Variable — Notes From My Journey With Numba” (dev.to link).
His piece captured something beautiful about computing: the joy of experimentation, the thrill of watching code go fast, and the curiosity to ask “what if?”.
“Somewhere between algorithms and hardware, Numba didn’t just make my code faster. It made exploration lighter.”
Reading his benchmarks, I couldn’t help but wonder: What happens when we throw Rust into the mix? What about raw CUDA? Where does the hardware actually give up?
So I built a dojo. Let’s spar.
🎯 The Challenge
Same challenge as Shreyan’s original experiment:
f(x) = sqrt(x² + 1) × sin(x) + cos(x/2)
Compute this for 20 million elements.
Simple math. Maximum optimization. Who wins?
🥊 The Contenders
Team Python 🐍
| Variant | Description |
|---|---|
| Pure Python | Baseline implementation (interpreter overhead, GIL‑bound). |
| NumPy Vectorized | Standard NumPy‑based approach. |
| Numba JIT | Single‑threaded compiled code via Numba. |
| Numba Parallel | Multi‑threaded version using prange. |
| Numba @vectorize | Parallel ufunc implementation. |
Team Rust 🦀
| Variant | Description |
|---|---|
| Single‑threaded | Idiomatic iterator‑based code. |
| Parallel (Rayon) | Work‑stealing parallelism with the Rayon crate. |
| Parallel Chunks | Cache‑optimized chunked processing. |
Team GPU 🎮
| Variant | Description |
|---|---|
| Numba CUDA | Python kernels executed on the GPU. |
| CUDA C++ FP64 | Double‑precision native implementation. |
| CUDA C++ FP32 | Single‑precision native implementation. |
| CUDA C++ Intrinsics | Hardware‑optimized math intrinsics. |
🏗️ The Setup
I wanted this to be reproducible and fair:
- Same computation across all implementations.
- Same array size (20 million
float64elements). - Same random seed (
42). - Multiple warm‑up runs to eliminate JIT/cache effects.
- Take the minimum of multiple runs (least noise).
The full benchmark suite is open source: github.com/copyleftdev/numba-dojo
# Run everything yourself
git clone https://github.com/copyleftdev/numba-dojo.git
cd numba-dojo
make all
📊 The Results
The Full Leaderboard
| Rank | Implementation | Time (ms) | Speedup vs NumPy |
|---|---|---|---|
| 🥇 | CUDA C++ FP32 | 0.21 | 3,255× |
| 🥈 | Numba CUDA FP32 | 2.52 | 265× |
| 🥉 | CUDA C++ FP64 | 4.11 | 162× |
| 4 | Numba CUDA FP64 | 4.14 | 161× |
| 5 | Rust Parallel | 12.39 | 54× |
| 6 | Numba @vectorize | 14.86 | 45× |
| 7 | Numba Parallel | 15.55 | 43× |
| 8 | Rust Single | 555.62 | 1.2× |
| 9 | Numba JIT | 558.30 | 1.2× |
| 10 | NumPy Vectorized | 667.30 | 1.0× |
| 11 | Pure Python | ~6,650 | 0.1× |

Speedup Visualization

Category Champions

🔬 What I Learned
1. GPU ≫ CPU (when it fits)
- RTX 3080 Ti:
0.21 ms→ 3,255× faster than NumPy. - For embarrassingly parallel, element‑wise workloads, GPUs are in a different league.
- The massive parallelism (80 SMs, thousands of cores) completely crushes sequential execution.
2. FP32 ≈ 20× faster than FP64 on consumer GPUs
CUDA FP64: 4.11 ms
CUDA FP32: 0.21 ms ← 20× faster!
- Consumer GeForce GPUs have very few FP64 units (≈ 1/32 the throughput of FP32).
- If your algorithm tolerates single‑precision, use FP32.
3. Rust ≈ Numba JIT (single‑threaded)
Rust (single‑threaded): 555.62 ms
Numba JIT: 558.30 ms
- Both compile to LLVM IR and generate almost identical code.
- The tiny difference is just noise, confirming Numba’s claim: “Feels like Python, behaves like C.”
4. Rust beats Numba in parallel (~20 % faster)
Rust Parallel (Rayon): 12.39 ms
Numba Parallel: 15.55 ms
- Rayon’s work‑stealing scheduler has lower overhead than Numba’s threading.
- For CPU‑parallel workloads in production, Rust has a clear edge.
5. We hit the memory‑bandwidth wall
Profiling the FP32 CUDA kernel gave:
Time: 0.21 ms
Bandwidth: ~777 GB/s achieved
Theoretical: 912 GB/s (RTX 3080 Ti)
Efficiency: 85 %
- The GPU is running at 85 % of peak memory bandwidth.
- The cores are largely idle; the bottleneck is moving data in and out of memory.
Bottom line
- GPUs dominate pure data‑parallel work.
- FP32 is the sweet spot on consumer hardware.
- Rust holds its own (and even outpaces Numba) on the CPU.
- Memory bandwidth is the ultimate ceiling for both.
Feel free to clone the repo and run the benchmarks yourself!
Waiting for data* – No algorithm can beat physics
This is the Roofline Model in action:
Peak Compute
/
/
Performance /
/ ← We're here (memory‑bound)
/
/
──────────────────────
Memory Bandwidth
For this workload the arithmetic intensity is low (few operations per byte), so we’ve hit the memory‑bandwidth ceiling.
🧪 The Code
Below are three self‑contained implementations of the same kernel.
1️⃣ Numba (the hero of the original article)
from numba import njit, prange
import numpy as np
@njit(parallel=True, fastmath=True, cache=True)
def compute_numba_parallel(arr, out):
"""Compute sqrt(x²+1)·sin(x) + cos(0.5·x) element‑wise."""
n = len(arr)
for i in prange(n):
x = arr[i]
out[i] = np.sqrt(x * x + 1.0) * np.sin(x) + np.cos(0.5 * x)
Just add @njit; the rest is pure NumPy‑style Python.
2️⃣ Rust (the challenger)
use rayon::prelude::*;
/// Compute sqrt(x²+1)·sin(x) + cos(0.5·x) element‑wise.
pub fn compute_parallel(arr: &[f64], out: &mut [f64]) {
out.par_iter_mut()
.zip(arr.par_iter())
.for_each(|(o, &x)| {
*o = (x * x + 1.0).sqrt() * x.sin() + (0.5 * x).cos();
});
}
rayon makes data‑parallelism feel as natural as ordinary iterators.
3️⃣ CUDA C++ (the champion)
#include <cuda_runtime.h>
#include <cmath>
__global__ void compute_fp32(const float *arr, float *out, size_t n) {
// One thread per element
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float x = arr[idx];
out[idx] = sqrtf(x * x + 1.0f) * sinf(x) + cosf(0.5f * x);
}
}
/* Helper to launch the kernel */
void launch_compute(const float *d_arr, float *d_out, size_t n) {
const int threads_per_block = 256;
const int blocks = (n + threads_per_block - 1) / threads_per_block;
compute_fp32<<<blocks, threads_per_block>>>(d_arr, d_out, n);
cudaDeviceSynchronize(); // error checking omitted for brevity
}
A straightforward grid‑stride kernel that maps one thread to each array element.
References
- Original inspiration – the article that sparked this comparison.
- Numba documentation – https://numba.pydata.org/
- Rayon (Rust) – https://github.com/rayon-rs/rayon
- Roofline Model – https://en.wikipedia.org/wiki/Roofline_model
Keep experimenting. Keep playing. That’s what computing is for. ✨
What’s your favorite performance‑optimization story? Drop it in the comments!