Backprop Finally Made Sense When I Reimplemented It in Rust
Source: Dev.to
Introduction
I never used PyTorch or TensorFlow. My ML background was NumPy and scikit‑learn: I could train models, tune parameters, and get reasonable results, but when it came to explaining why things worked, my understanding was shaky. Backpropagation especially felt like a black box. I knew the steps at a high level, but I didn’t feel them.
So I stopped using ML libraries entirely and rebuilt the core of a neural network from scratch in Rust. That’s when backprop finally made sense.
Why the abstractions hide the learning
The problem wasn’t NumPy or scikit‑learn—they do exactly what they promise. The problem was that they abstract away everything that actually matters for understanding. By removing the abstractions (no autograd, just flat buffers, explicit indexing, and hand‑written matrix operations) the mystery disappeared.
Memory layout example
let data = [1, 2, 3, 4, 5, 6];
let shape = (2, 3); // (rows, cols)
// Logical view
// [ 1 2 3 ]
// [ 4 5 6 ]
// Memory view (row‑major)
// [1][2][3][4][5][6]
// 0 1 2 3 4 5
In Rust you can’t “kind of” do a transpose—you have to explain exactly how indices move in memory:
let index = row * cols + col;
That constraint changed everything. You can’t wave at gradients; you have to compute and store them explicitly.
What backprop really is
Backprop stopped being mysterious when I had to implement it myself—not symbolically, but as concrete bookkeeping. The process boils down to three repeated actions:
- Applying the chain rule
- Reusing intermediate values from the forward pass
- Pushing gradients backward through matrix operations
Forward and backward passes
Forward pass:
X → [ Linear ] → [ Activation ] → ŷ → Loss
Backward pass:
∂Loss → [ dActivation ] → [ dLinear ] → ∂W, ∂X
When you write this by hand, a few things become painfully clear:
- Gradients don’t “flow” — they are accumulated.
- Shape alignment is the real constraint, not calculus.
- Most bugs stem from incorrect assumptions about dimensions, not from the math itself.
Simple computational graph
┌─── w1 ───┐
X ──► (+) (+) ──► Loss
└─── w2 ───┘
Backward:
[ \frac{\partial \text{Loss}}{\partial X} = \frac{\partial \text{Loss}}{\partial \text{path}_1}
- \frac{\partial \text{Loss}}{\partial \text{path}_2} ]
Backprop felt hard before because I never saw where the numbers actually lived.
Rust’s role in the learning process
Rust isn’t important because it’s fast here; it’s important because it’s unforgiving. It forces you to confront:
- How tensors are laid out in memory
- When data is copied vs. reused
- Which operations allocate new buffers
- Which gradients depend on which forward values
I avoided third‑party crates on purpose and used only the standard library. The goal wasn’t elegance or performance—it was transparency. If something worked, I wanted to be able to explain why it worked at the level of indices and buffers.
Step‑by‑step implementation
- A tensor type backed by a flat buffer
- Element‑wise operations
- Transpose, reduction, and matrix multiplication
- Linear regression
- Backpropagation and gradient updates
- A small neural network trained end‑to‑end
Nothing is optimized. Everything is explicit. This is not a framework.
Who should try this
- Software developers who want to understand neural networks beyond high‑level APIs
- Readers learning Rust who want a demanding, systems‑oriented project
If backprop still feels like something you “accept” rather than understand, rebuilding it once is worth the time.
Further reading
I documented the entire process as a chapter‑style guide, starting from tensors in memory and ending with a working neural network. You can read the full walkthrough here: