Understanding Backpropagation with Python Examples — Part 3

Published: 5 hours ago (January 17, 2026 at 02:54 PM EST)

2 min read

Source: Dev.to

Recap (Part 2)

In the previous article we plotted the Sum of Squared Residuals (SSR) against the bias (b_3) to visualise where the loss is minimal. The pink curve showed the relationship, and the lowest point on that curve corresponds to the optimal bias value.

Using Gradient Descent to Find the Optimal Bias

Instead of evaluating many bias values manually, we can apply gradient descent. This requires the derivative of the SSR with respect to (b_3).

Derivative via the Chain Rule

[ \frac{\partial \text{SSR}}{\partial b_3} = \sum_{i=1}^{n} \frac{\partial \text{SSR}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b_3} ]

For each training example the contribution is:

[ -2 \times (y_i - \hat{y}_i) \times 1 ]

where (y_i) is the true target and (\hat{y}_i) is the predicted value (the derivative of the prediction with respect to the bias is 1).

First Iteration ( (b_3 = 0) )

With the input vector ([0, 1, 0]) the model predicts:

Sample	(y_i)	(\hat{y}_i)
1	0	-2.6
2	1	-1.6
3	0	-2.61

The gradient is:

[ \begin{aligned} &-2 \times (0 - (-2.6)) \times 1 \ &;+; -2 \times (1 - (-1.6)) \times 1 \ &;+; -2 \times (0 - (-2.61)) \times 1 \ &= -15.7 \end{aligned} ]

Using a learning rate (\alpha = 0.1):

[ \text{step size} = \text{gradient} \times \alpha = -15.7 \times 0.1 = -1.57 ]

[ b_3^{\text{new}} = b_3^{\text{old}} - \text{step size} = 0 - (-1.57) = 1.57 ]

The green curve (new predictions) now looks improved.

Second Iteration ( (b_3 = 1.57) )

Predicted values become:

Sample	(y_i)	(\hat{y}_i)
1	0	-1.03
2	1	-0.03
3	0	-1.04

Gradient:

[ \begin{aligned} &-2 \times (0 - (-1.03)) \times 1 \ &;+; -2 \times (1 - (-0.03)) \times 1 \ &;+; -2 \times (0 - (-1.04)) \times 1 \ &= -6.26 \end{aligned} ]

Step size:

[ \text{step size} = -6.26 \times 0.1 = -0.626 ]

Update bias:

[ b_3^{\text{new}} = 1.57 - (-0.626) = 2.19 ]

Convergence

Repeating the update steps reduces the gradient magnitude. When (b_3 \approx 2.61) the gradient is close to 0, indicating that the loss has reached its minimum. Thus, (b_3 = 2.61) is the optimal bias for this simple example.

Takeaway

Gradient descent provides an efficient way to locate the minimum of the SSR curve without exhaustive search. By iteratively adjusting the bias in the direction opposite to the gradient, we quickly converge to the optimal value.

Feel free to experiment with the calculations in a Python notebook to reinforce the concepts.

Understanding Backpropagation with Python Examples — Part 3

Recap (Part 2)

Using Gradient Descent to Find the Optimal Bias

Derivative via the Chain Rule

First Iteration ( (b_3 = 0) )

Second Iteration ( (b_3 = 1.57) )

Convergence

Takeaway

Related posts

Refactoring Your Resume

Stop Drawing Stacks: Seeing Drupal on AWS as a Graph

What Happens When a Gold Mining Professional Tries Vibe Coding?

How I Built My 8-Bit Portfolio with Claude Opus 4.5 on Antigravity

Recap (Part 2)

Using Gradient Descent to Find the Optimal Bias

Derivative via the Chain Rule

First Iteration ( (b_3 = 0) )

Second Iteration ( (b_3 = 1.57) )

Convergence

Takeaway

Related posts

Refactoring Your Resume

Stop Drawing Stacks: Seeing Drupal on AWS as a Graph

What Happens When a Gold Mining Professional Tries Vibe Coding?

How I Built My 8-Bit Portfolio with Claude Opus 4.5 on Antigravity

Recap (Part 2)

First Iteration ( (b_3 = 0) )

Second Iteration ( (b_3 = 1.57) )