Understanding Backpropagation with Python Examples — Part 3

Published: (January 17, 2026 at 02:54 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Recap (Part 2)

In the previous article we plotted the Sum of Squared Residuals (SSR) against the bias (b_3) to visualise where the loss is minimal. The pink curve showed the relationship, and the lowest point on that curve corresponds to the optimal bias value.

Using Gradient Descent to Find the Optimal Bias

Instead of evaluating many bias values manually, we can apply gradient descent. This requires the derivative of the SSR with respect to (b_3).

Derivative via the Chain Rule

[ \frac{\partial \text{SSR}}{\partial b_3} = \sum_{i=1}^{n} \frac{\partial \text{SSR}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b_3} ]

For each training example the contribution is:

[ -2 \times (y_i - \hat{y}_i) \times 1 ]

where (y_i) is the true target and (\hat{y}_i) is the predicted value (the derivative of the prediction with respect to the bias is 1).

First Iteration ( (b_3 = 0) )

With the input vector ([0, 1, 0]) the model predicts:

Sample(y_i)(\hat{y}_i)
10-2.6
21-1.6
30-2.61

The gradient is:

[ \begin{aligned} &-2 \times (0 - (-2.6)) \times 1 \ &;+; -2 \times (1 - (-1.6)) \times 1 \ &;+; -2 \times (0 - (-2.61)) \times 1 \ &= -15.7 \end{aligned} ]

Using a learning rate (\alpha = 0.1):

[ \text{step size} = \text{gradient} \times \alpha = -15.7 \times 0.1 = -1.57 ]

[ b_3^{\text{new}} = b_3^{\text{old}} - \text{step size} = 0 - (-1.57) = 1.57 ]

The green curve (new predictions) now looks improved.

Second Iteration ( (b_3 = 1.57) )

Predicted values become:

Sample(y_i)(\hat{y}_i)
10-1.03
21-0.03
30-1.04

Gradient:

[ \begin{aligned} &-2 \times (0 - (-1.03)) \times 1 \ &;+; -2 \times (1 - (-0.03)) \times 1 \ &;+; -2 \times (0 - (-1.04)) \times 1 \ &= -6.26 \end{aligned} ]

Step size:

[ \text{step size} = -6.26 \times 0.1 = -0.626 ]

Update bias:

[ b_3^{\text{new}} = 1.57 - (-0.626) = 2.19 ]

Convergence

Repeating the update steps reduces the gradient magnitude. When (b_3 \approx 2.61) the gradient is close to 0, indicating that the loss has reached its minimum. Thus, (b_3 = 2.61) is the optimal bias for this simple example.

Takeaway

Gradient descent provides an efficient way to locate the minimum of the SSR curve without exhaustive search. By iteratively adjusting the bias in the direction opposite to the gradient, we quickly converge to the optimal value.

Feel free to experiment with the calculations in a Python notebook to reinforce the concepts.

Back to Blog

Related posts

Read more »

Refactoring Your Resume

Resume Builders - FlowCV – My personal favorite. It’s free, modern, and very hard to “break” the layout. - Standard Resume – Minimalist, high‑readability layou...