Understanding Backpropagation with Python Examples — Part 3
Source: Dev.to
Recap (Part 2)
In the previous article we plotted the Sum of Squared Residuals (SSR) against the bias (b_3) to visualise where the loss is minimal. The pink curve showed the relationship, and the lowest point on that curve corresponds to the optimal bias value.
Using Gradient Descent to Find the Optimal Bias
Instead of evaluating many bias values manually, we can apply gradient descent. This requires the derivative of the SSR with respect to (b_3).
Derivative via the Chain Rule
[ \frac{\partial \text{SSR}}{\partial b_3} = \sum_{i=1}^{n} \frac{\partial \text{SSR}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b_3} ]
For each training example the contribution is:
[ -2 \times (y_i - \hat{y}_i) \times 1 ]
where (y_i) is the true target and (\hat{y}_i) is the predicted value (the derivative of the prediction with respect to the bias is 1).
First Iteration ( (b_3 = 0) )
With the input vector ([0, 1, 0]) the model predicts:
| Sample | (y_i) | (\hat{y}_i) |
|---|---|---|
| 1 | 0 | -2.6 |
| 2 | 1 | -1.6 |
| 3 | 0 | -2.61 |
The gradient is:
[ \begin{aligned} &-2 \times (0 - (-2.6)) \times 1 \ &;+; -2 \times (1 - (-1.6)) \times 1 \ &;+; -2 \times (0 - (-2.61)) \times 1 \ &= -15.7 \end{aligned} ]
Using a learning rate (\alpha = 0.1):
[ \text{step size} = \text{gradient} \times \alpha = -15.7 \times 0.1 = -1.57 ]
[ b_3^{\text{new}} = b_3^{\text{old}} - \text{step size} = 0 - (-1.57) = 1.57 ]
The green curve (new predictions) now looks improved.
Second Iteration ( (b_3 = 1.57) )
Predicted values become:
| Sample | (y_i) | (\hat{y}_i) |
|---|---|---|
| 1 | 0 | -1.03 |
| 2 | 1 | -0.03 |
| 3 | 0 | -1.04 |
Gradient:
[ \begin{aligned} &-2 \times (0 - (-1.03)) \times 1 \ &;+; -2 \times (1 - (-0.03)) \times 1 \ &;+; -2 \times (0 - (-1.04)) \times 1 \ &= -6.26 \end{aligned} ]
Step size:
[ \text{step size} = -6.26 \times 0.1 = -0.626 ]
Update bias:
[ b_3^{\text{new}} = 1.57 - (-0.626) = 2.19 ]
Convergence
Repeating the update steps reduces the gradient magnitude. When (b_3 \approx 2.61) the gradient is close to 0, indicating that the loss has reached its minimum. Thus, (b_3 = 2.61) is the optimal bias for this simple example.
Takeaway
Gradient descent provides an efficient way to locate the minimum of the SSR curve without exhaustive search. By iteratively adjusting the bias in the direction opposite to the gradient, we quickly converge to the optimal value.
Feel free to experiment with the calculations in a Python notebook to reinforce the concepts.