Understanding Backpropagation with Python Examples — Part 3

Published: (January 17, 2026 at 02:54 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Recap (Part 2)

In the previous article we plotted the Sum of Squared Residuals (SSR) against the bias (b_3) to visualise where the loss is minimal. The pink curve showed the relationship, and the lowest point on that curve corresponds to the optimal bias value.

Using Gradient Descent to Find the Optimal Bias

Instead of evaluating many bias values manually, we can apply gradient descent. This requires the derivative of the SSR with respect to (b_3).

Derivative via the Chain Rule

[ \frac{\partial \text{SSR}}{\partial b_3} = \sum_{i=1}^{n} \frac{\partial \text{SSR}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b_3} ]

For each training example the contribution is:

[ -2 \times (y_i - \hat{y}_i) \times 1 ]

where (y_i) is the true target and (\hat{y}_i) is the predicted value (the derivative of the prediction with respect to the bias is 1).

First Iteration ( (b_3 = 0) )

With the input vector ([0, 1, 0]) the model predicts:

Sample(y_i)(\hat{y}_i)
10-2.6
21-1.6
30-2.61

The gradient is:

[ \begin{aligned} &-2 \times (0 - (-2.6)) \times 1 \ &;+; -2 \times (1 - (-1.6)) \times 1 \ &;+; -2 \times (0 - (-2.61)) \times 1 \ &= -15.7 \end{aligned} ]

Using a learning rate (\alpha = 0.1):

[ \text{step size} = \text{gradient} \times \alpha = -15.7 \times 0.1 = -1.57 ]

[ b_3^{\text{new}} = b_3^{\text{old}} - \text{step size} = 0 - (-1.57) = 1.57 ]

The green curve (new predictions) now looks improved.

Second Iteration ( (b_3 = 1.57) )

Predicted values become:

Sample(y_i)(\hat{y}_i)
10-1.03
21-0.03
30-1.04

Gradient:

[ \begin{aligned} &-2 \times (0 - (-1.03)) \times 1 \ &;+; -2 \times (1 - (-0.03)) \times 1 \ &;+; -2 \times (0 - (-1.04)) \times 1 \ &= -6.26 \end{aligned} ]

Step size:

[ \text{step size} = -6.26 \times 0.1 = -0.626 ]

Update bias:

[ b_3^{\text{new}} = 1.57 - (-0.626) = 2.19 ]

Convergence

Repeating the update steps reduces the gradient magnitude. When (b_3 \approx 2.61) the gradient is close to 0, indicating that the loss has reached its minimum. Thus, (b_3 = 2.61) is the optimal bias for this simple example.

Takeaway

Gradient descent provides an efficient way to locate the minimum of the SSR curve without exhaustive search. By iteratively adjusting the bias in the direction opposite to the gradient, we quickly converge to the optimal value.

Feel free to experiment with the calculations in a Python notebook to reinforce the concepts.

Back to Blog

Related posts

Read more »

𝗗𝗲𝘀𝗶𝗴𝗻𝗲𝗱 𝗮 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻‑𝗥𝗲𝗮𝗱𝘆 𝗠𝘂𝗹𝘁𝗶‑𝗥𝗲𝗴𝗶𝗼𝗻 𝗔𝗪𝗦 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗘𝗞𝗦 | 𝗖𝗜/𝗖𝗗 | 𝗖𝗮𝗻𝗮𝗿𝘆 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 | 𝗗𝗥 𝗙𝗮𝗶𝗹𝗼𝘃𝗲𝗿

!Architecture Diagramhttps://dev-to-uploads.s3.amazonaws.com/uploads/articles/p20jqk5gukphtqbsnftb.gif I designed a production‑grade multi‑region AWS architectu...