Cross Entropy Derivatives, Part 6: Using gradient descent to reach the final result

Published: (February 8, 2026 at 02:52 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Optimizing the Bias (b_3) – Getting the Exact Value

In the previous article we plotted a curve that helps us optimise the bias.
In this article we compute the accurate value for (b_3).

Derivative of the Cross‑Entropy Loss w.r.t. (b_3)

Because we have three observations and make one prediction per observation, the total derivative is obtained by summing one term per prediction.

1️⃣ First Observation

We focus on the prediction for the first observation.
The observed species is Setosa, so we compute the cross‑entropy using the predicted probability for Setosa.

The network’s forward‑pass expression (from the previous article) is:

Network forward‑pass

For petal width = 0.04 and sepal width = 0.42, the predicted probability for Setosa is:

Predicted probability (first obs)

Hence the contribution from the first observation to the derivative is

[ \frac{\partial L_1}{\partial b_3}= \hat{y}_1 - y_1 = 0.15 - 1 . ]

2️⃣ Second Observation

The second observation belongs to the species Virginica.
The derivative term for this observation is:

Derivative term (second obs)

Using petal width = 1.0 and sepal width = 0.54, the predicted probability for Setosa is:

Predicted probability (second obs)

3️⃣ Third Observation

The third observation belongs to the species Versicolor, so the derivative term is again:

Derivative term (third obs)

For petal width = 0.50 and sepal width = 0.37, the predicted probability for Setosa is:

Predicted probability (third obs)

📐 Total Derivative

Adding the three contributions:

[ \begin{aligned} \frac{\partial L}{\partial b_3} &= (0.15 - 1) ;+; 0.04 ;+; 0.04 \ &= -0.77 . \end{aligned} ]

This value represents the slope of the tangent line to the loss curve at (b_3 = -2).

🔄 Gradient‑Descent Update

We now plug this slope into the gradient‑descent update rule:

[ b_3^{\text{new}} = b_3^{\text{old}} - \alpha \frac{\partial L}{\partial b_3} ]

If we set the learning rate (\alpha = 1), then

[ b_3^{\text{new}} = -2 - ( -0.77 ) = -2 + 0.77 = -1.23 . ]

The update step is visualised as:

Gradient‑descent update

We repeat this process, using the updated value of (b_3) each time, until one of the stopping criteria is met (predictions stop improving, a maximum number of steps is reached, etc.).

In this example the predictions stop improving when (b_3 = -0.03).

🚀 Looking for an easier way to install tools, libraries, or entire repositories?

Try Installerpedia – a community‑driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

ipm install repo-name

…and you’re done! 🎉


End of article.

0 views
Back to Blog

Related posts

Read more »

Deep Learning Without Backpropagation

!Cover image for Deep Learning Without Backpropagationhttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2...