When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA

Published: (December 25, 2025 at 07:07 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Before we dive in, I want to acknowledge Shreyan Ghosh (@zenoguy) and his wonderful article “When Time Became a Variable — Notes From My Journey With Numba” (dev.to link).

His piece captured something beautiful about computing: the joy of experimentation, the thrill of watching code go fast, and the curiosity to ask “what if?”.

“Somewhere between algorithms and hardware, Numba didn’t just make my code faster. It made exploration lighter.”

Reading his benchmarks, I couldn’t help but wonder: What happens when we throw Rust into the mix? What about raw CUDA? Where does the hardware actually give up?

So I built a dojo. Let’s spar.

🎯 The Challenge

Same challenge as Shreyan’s original experiment:

f(x) = sqrt(x² + 1) × sin(x) + cos(x/2)

Compute this for 20 million elements.
Simple math. Maximum optimization. Who wins?

🥊 The Contenders

Team Python 🐍

VariantDescription
Pure PythonBaseline implementation (interpreter overhead, GIL‑bound).
NumPy VectorizedStandard NumPy‑based approach.
Numba JITSingle‑threaded compiled code via Numba.
Numba ParallelMulti‑threaded version using prange.
Numba @vectorizeParallel ufunc implementation.

Team Rust 🦀

VariantDescription
Single‑threadedIdiomatic iterator‑based code.
Parallel (Rayon)Work‑stealing parallelism with the Rayon crate.
Parallel ChunksCache‑optimized chunked processing.

Team GPU 🎮

VariantDescription
Numba CUDAPython kernels executed on the GPU.
CUDA C++ FP64Double‑precision native implementation.
CUDA C++ FP32Single‑precision native implementation.
CUDA C++ IntrinsicsHardware‑optimized math intrinsics.

🏗️ The Setup

I wanted this to be reproducible and fair:

  • Same computation across all implementations.
  • Same array size (20 million float64 elements).
  • Same random seed (42).
  • Multiple warm‑up runs to eliminate JIT/cache effects.
  • Take the minimum of multiple runs (least noise).

The full benchmark suite is open source: github.com/copyleftdev/numba-dojo

# Run everything yourself
git clone https://github.com/copyleftdev/numba-dojo.git
cd numba-dojo
make all

📊 The Results

The Full Leaderboard

RankImplementationTime (ms)Speedup vs NumPy
🥇CUDA C++ FP320.213,255×
🥈Numba CUDA FP322.52265×
🥉CUDA C++ FP644.11162×
4Numba CUDA FP644.14161×
5Rust Parallel12.3954×
6Numba @vectorize14.8645×
7Numba Parallel15.5543×
8Rust Single555.621.2×
9Numba JIT558.301.2×
10NumPy Vectorized667.301.0×
11Pure Python~6,6500.1×

Benchmark Results

Speedup Visualization

Speedup Chart

Category Champions

Category Comparison

🔬 What I Learned

1. GPU ≫ CPU (when it fits)

  • RTX 3080 Ti: 0.21 ms3,255× faster than NumPy.
  • For embarrassingly parallel, element‑wise workloads, GPUs are in a different league.
  • The massive parallelism (80 SMs, thousands of cores) completely crushes sequential execution.

2. FP32 ≈ 20× faster than FP64 on consumer GPUs

CUDA FP64: 4.11 ms
CUDA FP32: 0.21 ms   ← 20× faster!
  • Consumer GeForce GPUs have very few FP64 units (≈ 1/32 the throughput of FP32).
  • If your algorithm tolerates single‑precision, use FP32.

3. Rust ≈ Numba JIT (single‑threaded)

Rust (single‑threaded): 555.62 ms
Numba JIT:             558.30 ms
  • Both compile to LLVM IR and generate almost identical code.
  • The tiny difference is just noise, confirming Numba’s claim: “Feels like Python, behaves like C.”

4. Rust beats Numba in parallel (~20 % faster)

Rust Parallel (Rayon): 12.39 ms
Numba Parallel:       15.55 ms
  • Rayon’s work‑stealing scheduler has lower overhead than Numba’s threading.
  • For CPU‑parallel workloads in production, Rust has a clear edge.

5. We hit the memory‑bandwidth wall

Profiling the FP32 CUDA kernel gave:

Time:       0.21 ms
Bandwidth:  ~777 GB/s achieved
Theoretical: 912 GB/s (RTX 3080 Ti)
Efficiency: 85 %
  • The GPU is running at 85 % of peak memory bandwidth.
  • The cores are largely idle; the bottleneck is moving data in and out of memory.

Bottom line

  • GPUs dominate pure data‑parallel work.
  • FP32 is the sweet spot on consumer hardware.
  • Rust holds its own (and even outpaces Numba) on the CPU.
  • Memory bandwidth is the ultimate ceiling for both.

Feel free to clone the repo and run the benchmarks yourself!

Waiting for data* – No algorithm can beat physics

This is the Roofline Model in action:

                    Peak Compute
                         /
                        /
Performance            /
                      /  ←  We're here (memory‑bound)
                     /
                    /
            ──────────────────────
                Memory Bandwidth

For this workload the arithmetic intensity is low (few operations per byte), so we’ve hit the memory‑bandwidth ceiling.

🧪 The Code

Below are three self‑contained implementations of the same kernel.

1️⃣ Numba (the hero of the original article)

from numba import njit, prange
import numpy as np

@njit(parallel=True, fastmath=True, cache=True)
def compute_numba_parallel(arr, out):
    """Compute sqrt(x²+1)·sin(x) + cos(0.5·x) element‑wise."""
    n = len(arr)
    for i in prange(n):
        x = arr[i]
        out[i] = np.sqrt(x * x + 1.0) * np.sin(x) + np.cos(0.5 * x)

Just add @njit; the rest is pure NumPy‑style Python.

2️⃣ Rust (the challenger)

use rayon::prelude::*;

/// Compute sqrt(x²+1)·sin(x) + cos(0.5·x) element‑wise.
pub fn compute_parallel(arr: &[f64], out: &mut [f64]) {
    out.par_iter_mut()
        .zip(arr.par_iter())
        .for_each(|(o, &x)| {
            *o = (x * x + 1.0).sqrt() * x.sin() + (0.5 * x).cos();
        });
}

rayon makes data‑parallelism feel as natural as ordinary iterators.

3️⃣ CUDA C++ (the champion)

#include <cuda_runtime.h>
#include <cmath>

__global__ void compute_fp32(const float *arr, float *out, size_t n) {
    // One thread per element
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float x = arr[idx];
        out[idx] = sqrtf(x * x + 1.0f) * sinf(x) + cosf(0.5f * x);
    }
}

/* Helper to launch the kernel */
void launch_compute(const float *d_arr, float *d_out, size_t n) {
    const int threads_per_block = 256;
    const int blocks = (n + threads_per_block - 1) / threads_per_block;
    compute_fp32<<<blocks, threads_per_block>>>(d_arr, d_out, n);
    cudaDeviceSynchronize();   // error checking omitted for brevity
}

A straightforward grid‑stride kernel that maps one thread to each array element.

References

Keep experimenting. Keep playing. That’s what computing is for.

What’s your favorite performance‑optimization story? Drop it in the comments!

Back to Blog

Related posts

Read more »