Part 7: CUDA Integration with Python

Published: 1 month ago (December 31, 2025 at 10:30 AM EST)

6 min read

Source: Dev.to

XOR Test

After successfully setting up the Neural Network, I tested it with the XOR operation. XOR is a non‑linear operation, so it’s kind of a “Hello World” for verifying that the network can detect non‑linearity. The test ran smoothly, indicating that the network is ready to take on bigger assignments.

I had to replicate the same in Rust, but my mind wouldn’t let me do that. Whenever I tried to write a single line of Rust code, my mind started questioning the following:

What would happen if I fed it a complex equation?
What would happen if I ran the linear‑regression data through the neural network?
What would it do with the logistic‑regression dataset?
Will it work with CUDA?
Will the linear‑regression program work with CUDA?

So many unanswered questions. It broke my flow.

The CUDA‑Based Linear Regression Program

As I was already working with CUDA at that point, the most low‑hanging fruit for me was to integrate CUDA into both the Python and Rust versions of my library. I switched over to the linear‑regression program to execute it on the GPU.

(Un)surprisingly, this did not work. My CPU‑bound program was giving a very low Root MSE = 7.75, but the GPU version was returning 40.

Debugging

After searching the code line by line, I discovered I was returning a zeroed matrix from the GPU in my linear‑prediction function. I fixed it to return the actual output instead, and the program started behaving normally.

Results:
Total samples: 93
Mean Squared Error: 59.6779
Root MSE: 7.7251

Similarly, I wrote another CUDA program for logistic regression and it also went fine. Unfortunately, I somehow missed capturing the results of the CUDA logistic program.

One question shot down: Rust CUDA program can work with the regression datasets. Let’s move on.

CUDA Integration with Python

Once I was happy with the results, I turned to the Python neural‑network script used for the XOR test dataset. I had already worked with bigger datasets, so running the simple XOR test didn’t feel right. I planned to run the linear‑regression and logistic‑regression datasets on the neural network too; ideally, they should run properly without issue.

I wired the JSON file to the script and started training the neural network.

Oh boy, I just opened another rabbit hole.

On a huge data load, the CPU is not powerful enough to work sequentially, even though NumPy does some minuscule parallelisation on the array. The email‑spam/ham dataset was the tipping point—my CPU could not handle the load any longer. I looked for a solution in scikit‑learn and found that the library has limited GPU support through cupy, but there are some setup challenges.

Since I already understood the maths and had a NumPy version of the neural‑network program handy, I switched the NumPy imports to CuPy. Obviously, I would miss a lot of optimisations, but that would give me a push to learn optimisation techniques later.

Solution planned: I fired up WSL, installed cupy, and swapped numpy with cupy in the script. Then another bit of quicksand drowned me.

The same program that ran in 10 minutes with thousands of iterations in the NumPy version was not even completing 1 000 iterations in 10 minutes in the CuPy version. Thankfully, my earlier Rust experience helped: the bottleneck was host‑to‑device (H2D) and device‑to‑host (D2H) data‑copy overhead.

I revisited the library documentation, found a fix, and applied it. Two adjustments were needed:

Remove the error calculation at every epoch to avoid the D2H copy.
Use synchronize after a set of operations instead of after each epoch.

The Takeaway

After fixing the H2D/D2H copy overhead, the program worked decently. I was able to run the same script now within 10–15 seconds—a ≈60× speedup.

Honestly, I got addicted to the speed…

After a long day of fighting through setup and debugging, a little playtime was approved. I started experimenting:

First a single layer (not much load).
Then I added two more layers, just for fun.

With this, the neural network brought down the MSE of the linear‑regression dataset even lower:

Test MSE: 52.3265 (From earlier 59.6779)
Test MAE: 5.3395

The effort was worth it once again. Seeing the network learn step‑by‑step in each output was satisfying. With the speed boost, I could choose a lower learning rate and a higher epoch count (number of training iterations), as well as tweak network configurations (number of layers, nodes per layer, etc.).

Epoch 1/200000   | Error: 25.926468
Epoch 1000/200000| Error: 0.482236
Epoch 2000/200000| Error: 0.414885
Epoch 3000/200000| Error: 0.377820
Epoch 4000/200000| Error: 0.354329
Epoch 5000/200000| Error: 0.340112
Epoch 6000/200000| Error: 0.331060
Epoch 7000/200000| Error: 0.324392
Epoch 8000/200000| Error: 0.319276
Epoch 9000/200000| Error: 0.315130
Epoch 10000/200000| Error: 0.311793
Epoch 11000/200000| Error: 0.308888
Epoch 12000/200000| Error: 0.306242
Epoch 13000/200000| Error: 0.303405
Epoch 14000/200000| Error: 0.300487
Epoch 15000/200000| Error: 0.298240
Epoch 16000/200000| Error: 0.296392

While looking into the errors, I noticed how Gradient Descent works: at first the errors are high and the network corrects quickly toward convergence; as time passes and the network learns, the errors go down and the corrections become smaller.

I tried different learning rates. When I chose a larger learning rate, the neural network failed to converge—it oscillated between two points and finally returned a higher error.

Experiments with Learning Rate and Network Depth

On the other hand, using a smaller learning rate gave me a smoother convergence, but it took a very long time to converge.

I also noticed that even with a tiny learning rate of 0.005, the network would almost stabilize after 20 000 epochs, yet it could still show slight oscillations.

Another observation was that running more and more training loops does not change the Mean Absolute Error (MAE) much. At some point, saturation is inevitable.

Best result with 2 hidden layers

Training completed in 285.5212 seconds.

Final Results after 100 000 epochs and learning rate 0.001:

Test MSE: 46.2653
Test MAE: 5.2730

Comparable result (fewer epochs, higher learning rate)

Training completed in 117.2746 seconds.

Final Results after 40 000 epochs and learning rate 0.005:

Test MSE: 52.6716
Test MAE: 5.3199
Starting training...

Best result with 1 hidden layer

Training completed in 77.6143 seconds.

Final Results after 40 000 epochs and learning rate 0.005:

Test MSE: 45.5669
Test MAE: 5.1707

By the time I finished all my experiments, 6 hours had passed. These insights helped me understand neural networks a little better than before.

I always doubted the performance of my earlier Logistic Regression program written in Rust. With successful CUDA integration, it was time to do a fair comparison. I implemented logistic regression in the neural network by using a single linear layer followed by a sigmoid layer.

The result surprised me: my inefficiently written CUDA program was actually faster than my cupy program. I didn’t investigate further; I called it a day.

Quick inventory check

CPU Linear Regression
CPU Logistic Regression
GPU Linear Regression
GPU Logistic Regression
Python script to run GPU‑powered neural network

And you know what? This built the perfect foundation for more experimentation. Stay tuned to see what’s next!

Part 7: CUDA Integration with Python

XOR Test

The CUDA‑Based Linear Regression Program

Debugging

CUDA Integration with Python

The Takeaway

Experiments with Learning Rate and Network Depth

Best result with 2 hidden layers

Comparable result (fewer epochs, higher learning rate)

Best result with 1 hidden layer

Quick inventory check

Related posts

MyTorch – Minimalist autograd in 450 lines of Python

Coding Practice Roadmap for College Students: Learn Programming

Stop Using Pip: Why I Switched to “uv” for Python Projects (10x Faster)

Sandboxing Untrusted Python