Part 9: Generating Simba Network with Rust

Published: (January 1, 2026 at 06:56 PM EST)
8 min read
Source: Dev.to

Source: Dev.to

Image → CSV Encoder / Decoder

I found a black‑and‑white drawing of a lion cub and used the following script to convert the pixels to CSV numbers (and back again):

Image to CSV Encoder / Decoder

Once the CSV was generated, I fed it to the neural network with my usual Python training script.

Helping the Machine to Draw

The script struggled at several points, so I added a few fixes to help the network learn how to draw Simba.

The Large‑Image Issue

  • Original size: 474 × 474 px – caused training to be unbearably slow.
  • Adjusted size: 200 × 200 px – keeps training time reasonable.

The Machine Crash & Restart

Occasional overheating caused the machine to crash, and training would restart from scratch each time—a huge waste of time and resources.

Solution: Added a save / load checkpoint mechanism so training can resume from the last saved model.

The Error‑Oscillation Problem

No matter how small the learning rate, the loss got stuck in an oscillation loop. I tried an extremely tiny learning rate (e.g., 0.000001), but training became impractically slow.

Instead, I implemented cosine annealing to gradually decrease the learning rate:

# Cosine annealing learning‑rate schedule
decay_factor = 0.5 * (1 + math.cos(math.pi * i / (epochs + epoch_offset)))
current_lr   = lr_min + (lr_max - lr_min) * decay_factor

This produced smooth, steady learning.

The Result: Art Through Math

After fixing the issues and running the network for ≈ 1.2 M iterations (about 4 h on my machine), the generated output was very close to the input image.

The input represents a highly complex, high‑dimensional function—far beyond a simple XOR gate or logistic‑regression dataset. This experiment demonstrates the Universal Approximation Theorem: a neural network can represent (almost) any continuous function.

Original Image

Original Image

Initial Static (Random Weights)

Initial Static

Final Reconstruction (200 × 200)

Final Reconstruction

Reconstruction on Higher Resolutions

Using the learned weights, the network was tested on various canvas sizes (50 × 50, 512 × 512, 1024 × 1024, etc.). It consistently reproduced the image, indicating that it learned the underlying function rather than memorising individual pixels.

1024 × 1024 reconstruction

Higher‑Resolution Reconstruction

Because the network learned the mathematical concept of the lines, the 1024 × 1024 version isn’t pixelated like a naïve zoom; it looks as if the network is redrawing the masterpiece on a larger canvas.

Validation & Next Steps

I built a highly complex, inefficient, and expensive image‑scaler. The result is satisfying but not perfect—it proves the point, but I want higher fidelity.

After posting the result on Reddit, a user suggested trying SIREN (Sinusoidal Representation Networks).

  • SIREN uses sine activations instead of ReLU/Tanh.
  • It excels at implicit neural representation, a technique very similar to what I was attempting.

I have now implemented a SIREN in Python with the hope of achieving even better reconstructions. Stay tuned for the next post!

The Rust Comeback

The Python prototype showed fruitful results, reigniting my enthusiasm to tackle the problem in Rust. After a brief pause, I spent another week preparing the Rust implementation:

  • Created a Tensor trait and moved all defined methods into it.
  • Implemented a CpuTensor struct with the Tensor trait.
  • Implemented a GpuTensor struct with the Tensor trait.

The Initial Shock

The GPU tensor took 90 + seconds to run the same network that completed in 8 seconds in Python. Profiling with Nsight Systems (nsys) revealed that most of the runtime was spent in memory allocation and deallocation, not in CUDA kernels.

CuPy avoids this overhead by using a custom memory‑pool implementation. I needed a similar solution for Rust. The cust crate does not expose a direct pool API, but the CUDA driver provides one, re‑exported under cust::sys. After considerable trial‑and‑error, I integrated a working memory‑pool helper.

Memory‑Pool Helper (Rust)

pub fn get_mem_pool() -> CudaMemoryPool {
    // Get the first CUDA device
    let device = Device::get_device(0).unwrap();

    // Create a memory pool for the device
    let mut pool = std::ptr::null_mut();
    let pool_props = CUmemPoolProps {
        allocType: cust::sys::CUmemAllocationType::CU_MEM_ALLOCATION_TYPE_PINNED,
        handleTypes: cust::sys::CUmemAllocationHandleType::CU_MEM_HANDLE_TYPE_NONE,
        location: cust::sys::CUmemLocation {
            type_: cust::sys::CUmemLocationType_enum::CU_MEM_LOCATION_TYPE_DEVICE,
            id: 0,
        },
        win32SecurityAttributes: std::ptr::null_mut(),
        reserved: [0u8; 64],
    };

    // Create the pool (unsafe because it calls the CUDA driver API)
    unsafe { cuMemPoolCreate(&mut pool, &pool_props) };

    // Reserve a large chunk of memory once, then return it to the pool.
    let reserve_size: usize = 2048 * 1024 * 1024; // 2 GiB
    let mut reserve_ptr: CUdeviceptr = 0;
    unsafe {
        // Allocate from the pool (synchronous the first time)
        cuMemAllocFromPoolAsync(
            &mut reserve_ptr,
            reserve_size,
            pool,
            std::ptr::null_mut(),
        );
        cuStreamSynchronize(std::ptr::null_mut());

        // Immediately free it back to the pool for reuse
        cuMemFreeAsync(reserve_ptr, std::ptr::null_mut());
        cuStreamSynchronize(std::ptr::null_mut());
    }

    println!("Memory pool created for device {}", device.name().unwrap());

    CudaMemoryPool {
        pool: Arc::new(Mutex::new(UnsafeCudaMemPoolHandle(pool))),
    }
}

Running this helper in a simple main program showed thousands of memory blocks being allocated and deallocated in milliseconds. The one‑time pool creation cost is negligible compared to the speed gains on subsequent allocations.

Integration in GpuTensor

After adding the helper, I kept the pool alive using Arc<…> and wrapped all device allocations to go through it. Initially the performance issue persisted, so I investigated further.

Copy, Paste… And… Compiler Error

The pool needed to be stored globally, similar to the CUDA context, to survive for the program’s lifetime. I also discovered that the CUDA wrappers themselves were still using the default allocator, so I had to replace those calls with pool‑based allocations.

Handling raw pointers became necessary. Below is a trimmed example of a custom device buffer that links host memory and device memory:

impl Drop for CustomDeviceBuffer {
    fn drop(&mut self) {
        let pool = match &GPU_CONTEXT.get() {
            // …
        };
        // …
    }
}

(Implementation details omitted for brevity.)

Reflections

  • SIREN vs. Sigmoid – Switching to sine activations gave sharper reconstructions, but required a steep learning curve.
  • Python → Rust – The Python prototype built confidence; Rust exposed performance pitfalls, especially around memory management, prompting a deep dive into CUDA internals.
  • Memory Pools – Using CUDA’s memory‑pool API via cust::sys dramatically reduced allocation overhead. Keeping the pool alive (Arc<…>) and routing all device allocations through it was key.
  • Raw Pointers – Safe handling is possible with careful Drop implementations and proper synchronization.

Example: Obtaining the CUDA Pool from the Global Context (Rust)

// Retrieve the pool (panics if the context or pool is missing)
let pool = GPU_CONTEXT
    .get()
    .expect("No GPU context set")
    .pool
    .as_ref()
    .expect("CUDA not initialized or GPU pool not set");

Example: Freeing a Device Pointer (Rust)

// `pool` is the `Arc<MemoryPool>` obtained above
let _ = pool.free(self.as_device_ptr().as_raw());

Example: Allocating a Custom Device Buffer Through the Pool (Rust)

use cust::{
    memory::{DeviceBuffer, DevicePointer},
    prelude::*,
};
use std::{
    mem::size_of,
    sync::Arc,
};

/// Wrapper around a `DeviceBuffer` that is always allocated from the CUDA memory pool.
pub struct CustomDeviceBuffer<T> {
    pub device_buffer: DeviceBuffer<T>,
}

impl<T> CustomDeviceBuffer<T> {
    /// Allocate a buffer of `size` elements using the global CUDA memory pool.
    pub fn get_device_buffer(size: usize) -> Self {
        // 1️⃣ Grab the pool (panic if it isn’t available)
        let pool: Arc<cust::memory::MemoryPool> = GPU_CONTEXT
            .get()
            .expect("No GPU context set")
            .pool
            .as_ref()
            .expect("CUDA not initialized or GPU pool not set")
            .clone();

        // 2️⃣ Compute the byte size, checking for overflow
        let byte_size = size
            .checked_mul(size_of::<T>())
            .expect("Requested allocation size overflowed");

        if byte_size == 0 {
            panic!("Attempted to allocate a zero‑size buffer");
        }

        // 3️⃣ Allocate raw memory from the pool
        let raw_ptr = pool
            .allocate(byte_size)
            .expect("CUDA memory‑pool allocation failed");

        // 4️⃣ Turn the raw pointer into a `DevicePointer<T>`
        let dev_ptr = unsafe { DevicePointer::from_raw(raw_ptr as *mut T) };

        // 5️⃣ Build a `DeviceBuffer<T>` from the raw parts (unsafe but safe here)
        let device_buffer = unsafe { DeviceBuffer::from_raw_parts(dev_ptr, size) };

        // 6️⃣ Return the wrapped buffer
        Self { device_buffer }
    }
}

Key points

  1. Safety – The only unsafe blocks are the conversions from raw pointers to DevicePointer and from raw parts to DeviceBuffer. All pre‑conditions (non‑zero size, correct alignment, successful allocation) are verified beforehand.
  2. Memory‑pool usage – Allocation is performed via pool.allocate, guaranteeing that the buffer participates in the pool’s reuse strategy.
  3. Error handlingexpect is used for brevity; production code would return a Result<_, cust::error::CudaError> instead of panicking.

The Raw Pointer Size Issue

What happened

  • I allocated memory based only on the length of the array.
  • I forgot to multiply by the byte width of the element type.

For an array of f32 with length 1, I allocated 1 byte instead of the required 4 bytes.

Fix

// Incorrect allocation (length only)
let size = length; // → 1 byte for f32[1]

// Correct allocation (length × element size)
let size = length * std::mem::size_of::<f32>(); // → 4 bytes for f32[1]

After correcting the allocation, the code compiled and ran successfully, though execution time did not improve.

The Hidden Bug

Tensor::ones was recreating a Vec of ones and copying it to the device at every epoch. I replaced this with a GPU kernel that fills memory directly:

extern "C" __global__ void fill_value(float *out, int n, float value)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = value;
    }
}

cuBLAS Wrapper (Rust)

Result {
    let m = self.shape[0] as i32;
    let k = self.shape[1] as i32;
    let n = rhs.shape[1] as i32;

    let total_elements = (m * n) as usize;
    let result = get_device_buffer(total_elements);

    let alpha = T::one();
    let beta  = T::zero();

    unsafe {
        cublasSgemm_v2(
            Self::_get_cublas_handle(),
            cublasOperation_t::CUBLAS_OP_N,
            cublasOperation_t::CUBLAS_OP_N,
            n,
            m,
            k,
            &alpha.f32(),
            rhs.device_buffer.as_device_ptr().as_raw() as *const f32,
            n,
            self.device_buffer.as_device_ptr().as_raw() as *const f32,
            k,
            &beta.f32(),
            result.as_device_ptr().as_raw() as *mut f32,
            n,
        );
    }

    let result_shape = vec![self.shape[0], rhs.shape[1]];
    Ok(Self::_with_device_buffer(result_shape, result))
}

The cuBLAS version did not yield a speed‑up because the matrix sizes were modest; cuBLAS shines with very large matrices.

Conclusion

The journey was long and full of setbacks, but each obstacle taught me valuable lessons about neural‑network design and low‑level GPU programming in Rust. I now have a functional Rust Tensor library that leverages CUDA memory pools, custom device buffers, and (optionally) SIREN networks for high‑quality image reconstruction.

Note: The XOR test dataset is not a concern here; the solution works, and I will definitely benefit when I tackle the much larger image‑reconstruction task.

Back to Blog

Related posts

Read more »

You Don't Need a CLAUDE.md

How to Manage CLAUDE.md / AGENTS.md Effectively This article does not cover the usual prompt‑engineering basics e.g., “avoid negative examples,” “set the LLM r...