Cross Entropy Derivatives, Part 6: Using gradient descent to reach the final result

Published: 3 days ago (February 8, 2026 at 02:52 PM EST)

3 min read

Source: Dev.to

Optimizing the Bias (b_3) – Getting the Exact Value

In the previous article we plotted a curve that helps us optimise the bias.
In this article we compute the accurate value for (b_3).

Derivative of the Cross‑Entropy Loss w.r.t. (b_3)

Because we have three observations and make one prediction per observation, the total derivative is obtained by summing one term per prediction.

1️⃣ First Observation

We focus on the prediction for the first observation.
The observed species is Setosa, so we compute the cross‑entropy using the predicted probability for Setosa.

The network’s forward‑pass expression (from the previous article) is:

Network forward‑pass

For petal width = 0.04 and sepal width = 0.42, the predicted probability for Setosa is:

Predicted probability (first obs)

Hence the contribution from the first observation to the derivative is

[ \frac{\partial L_1}{\partial b_3}= \hat{y}_1 - y_1 = 0.15 - 1 . ]

2️⃣ Second Observation

The second observation belongs to the species Virginica.
The derivative term for this observation is:

Derivative term (second obs)

Using petal width = 1.0 and sepal width = 0.54, the predicted probability for Setosa is:

Predicted probability (second obs)

3️⃣ Third Observation

The third observation belongs to the species Versicolor, so the derivative term is again:

Derivative term (third obs)

For petal width = 0.50 and sepal width = 0.37, the predicted probability for Setosa is:

Predicted probability (third obs)

📐 Total Derivative

Adding the three contributions:

[ \begin{aligned} \frac{\partial L}{\partial b_3} &= (0.15 - 1) ;+; 0.04 ;+; 0.04 \ &= -0.77 . \end{aligned} ]

This value represents the slope of the tangent line to the loss curve at (b_3 = -2).

🔄 Gradient‑Descent Update

We now plug this slope into the gradient‑descent update rule:

[ b_3^{\text{new}} = b_3^{\text{old}} - \alpha \frac{\partial L}{\partial b_3} ]

If we set the learning rate (\alpha = 1), then

[ b_3^{\text{new}} = -2 - ( -0.77 ) = -2 + 0.77 = -1.23 . ]

The update step is visualised as:

Gradient‑descent update

We repeat this process, using the updated value of (b_3) each time, until one of the stopping criteria is met (predictions stop improving, a maximum number of steps is reached, etc.).

In this example the predictions stop improving when (b_3 = -0.03).

🚀 Looking for an easier way to install tools, libraries, or entire repositories?

Try Installerpedia – a community‑driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

ipm install repo-name

…and you’re done! 🎉

End of article.

Cross Entropy Derivatives, Part 6: Using gradient descent to reach the final result

Optimizing the Bias (b_3) – Getting the Exact Value

Derivative of the Cross‑Entropy Loss w.r.t. (b_3)

1️⃣ First Observation

2️⃣ Second Observation

3️⃣ Third Observation

📐 Total Derivative

🔄 Gradient‑Descent Update

🚀 Looking for an easier way to install tools, libraries, or entire repositories?

Related posts

From Non-Profit Ops Manager to Building Neural Networks: Week 1

Image Classification with Convolutional Neural Networks – Part 1: Why We Need CNNs

Deep Learning Without Backpropagation

MIT's new fine-tuning method lets LLMs learn new skills without losing old ones