Cross Entropy Derivatives, Part 2: Setting Up the Derivative with Respect to a Bias

Published: 1 day ago (February 2, 2026 at 02:57 PM EST)

2 min read

Source: Dev.to

Introduction

In the previous article we reviewed the key ideas needed to work with derivatives of cross‑entropy. In this article we set up the derivative step‑by‑step.

Predicted probability for Setosa

When we plug a predicted probability into the cross‑entropy equation, the form of the equation depends on the observed species. Because the observed species is Setosa, the corresponding predicted probability is the predicted probability for Setosa.

Using the softmax function, the predicted probability for Setosa is

[ p_{\text{Setosa}} = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} ]

where (z_1, z_2, z_3) are the raw output values (logits) for Setosa, Versicolor, and Virginica respectively.

Substituting this into the cross‑entropy loss gives

[ L = -\log!\left(\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}\right) ]

If the observed species were Virginica, the softmax equation for Virginica would give the corresponding predicted probability

[ p_{\text{Virginica}} = \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}} ]

Each case leads to a slightly different loss expression, and consequently to different derivatives of the cross‑entropy with respect to the bias term (b_3).

Summary of derivatives

The derivatives of the cross‑entropy loss with respect to the bias (b_3) can be summarized as follows:

[ \frac{\partial L}{\partial b_3}= \begin{cases} \displaystyle \frac{\partial L}{\partial p_{\text{Setosa}}}, \frac{\partial p_{\text{Setosa}}}{\partial z_1}, \frac{\partial z_1}{\partial b_3}, & \text{if observed = Setosa}\[10pt] \displaystyle \frac{\partial L}{\partial p_{\text{Virginica}}}, \frac{\partial p_{\text{Virginica}}}{\partial z_3}, \frac{\partial z_3}{\partial b_3}, & \text{if observed = Virginica}\[10pt] \text{(similar terms for Versicolor)} & \dots \end{cases} ]

Derivative for Setosa with respect to (b_3)

Cross‑entropy loss definition

The cross‑entropy loss for a single example where the true class is Setosa is

[ L = -\log(p_{\text{Setosa}}) ]

Softmax prediction

[ p_{\text{Setosa}} = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} ]

The inputs to the softmax are the raw output values (logits) (z_1, z_2, z_3). Only the logit for Setosa ((z_1)) is directly influenced by the bias (b_3) (through the network architecture that adds (b_3) to the raw output for Setosa).

Applying the chain rule

To optimize (b_3) using gradient descent we need

[ \frac{\partial L}{\partial b_3} = \frac{\partial L}{\partial p_{\text{Setosa}}}, \frac{\partial p_{\text{Setosa}}}{\partial z_1}, \frac{\partial z_1}{\partial b_3} ]

(\displaystyle \frac{\partial L}{\partial p_{\text{Setosa}}}= -\frac{1}{p_{\text{Setosa}}})
(\displaystyle \frac{\partial p_{\text{Setosa}}}{\partial z_1}= p_{\text{Setosa}}(1-p_{\text{Setosa}})) (softmax derivative)
(\displaystyle \frac{\partial z_1}{\partial b_3}=1) if (b_3) is added directly to (z_1); otherwise it follows the network’s wiring.

Multiplying these terms yields the gradient of the loss with respect to (b_3) for a Setosa observation.

Next steps

In the next article we will compute each of these terms explicitly and show how they combine to give the final gradient used in gradient‑descent updates.

Cross Entropy Derivatives, Part 2: Setting Up the Derivative with Respect to a Bias

Introduction

Predicted probability for Setosa

Summary of derivatives

Derivative for Setosa with respect to (b_3)

Cross‑entropy loss definition

Softmax prediction

Applying the chain rule

Next steps

Related posts

Cross Entropy Derivatives, Part 3: Chain Rule for a Single Output Class

Unveiling the MEMORY-NATIVE-NEURAL-NETWORK (MNNN) Family: Rewriting AI's Approach to Memory

Understanding the exploding gradient problem

Get Started With Image Classification in Kaggle using Python