Cross Entropy Derivatives, Part 2: Setting Up the Derivative with Respect to a Bias
Source: Dev.to
Introduction
In the previous article we reviewed the key ideas needed to work with derivatives of cross‑entropy. In this article we set up the derivative step‑by‑step.
Predicted probability for Setosa
When we plug a predicted probability into the cross‑entropy equation, the form of the equation depends on the observed species. Because the observed species is Setosa, the corresponding predicted probability is the predicted probability for Setosa.
Using the softmax function, the predicted probability for Setosa is
[ p_{\text{Setosa}} = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} ]
where (z_1, z_2, z_3) are the raw output values (logits) for Setosa, Versicolor, and Virginica respectively.
Substituting this into the cross‑entropy loss gives
[ L = -\log!\left(\frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}}\right) ]
If the observed species were Virginica, the softmax equation for Virginica would give the corresponding predicted probability
[ p_{\text{Virginica}} = \frac{e^{z_3}}{e^{z_1}+e^{z_2}+e^{z_3}} ]
Each case leads to a slightly different loss expression, and consequently to different derivatives of the cross‑entropy with respect to the bias term (b_3).
Summary of derivatives
The derivatives of the cross‑entropy loss with respect to the bias (b_3) can be summarized as follows:
[ \frac{\partial L}{\partial b_3}= \begin{cases} \displaystyle \frac{\partial L}{\partial p_{\text{Setosa}}}, \frac{\partial p_{\text{Setosa}}}{\partial z_1}, \frac{\partial z_1}{\partial b_3}, & \text{if observed = Setosa}\[10pt] \displaystyle \frac{\partial L}{\partial p_{\text{Virginica}}}, \frac{\partial p_{\text{Virginica}}}{\partial z_3}, \frac{\partial z_3}{\partial b_3}, & \text{if observed = Virginica}\[10pt] \text{(similar terms for Versicolor)} & \dots \end{cases} ]
Derivative for Setosa with respect to (b_3)
Cross‑entropy loss definition
The cross‑entropy loss for a single example where the true class is Setosa is
[ L = -\log(p_{\text{Setosa}}) ]
Softmax prediction
[ p_{\text{Setosa}} = \frac{e^{z_1}}{e^{z_1}+e^{z_2}+e^{z_3}} ]
The inputs to the softmax are the raw output values (logits) (z_1, z_2, z_3). Only the logit for Setosa ((z_1)) is directly influenced by the bias (b_3) (through the network architecture that adds (b_3) to the raw output for Setosa).
Applying the chain rule
To optimize (b_3) using gradient descent we need
[ \frac{\partial L}{\partial b_3} = \frac{\partial L}{\partial p_{\text{Setosa}}}, \frac{\partial p_{\text{Setosa}}}{\partial z_1}, \frac{\partial z_1}{\partial b_3} ]
- (\displaystyle \frac{\partial L}{\partial p_{\text{Setosa}}}= -\frac{1}{p_{\text{Setosa}}})
- (\displaystyle \frac{\partial p_{\text{Setosa}}}{\partial z_1}= p_{\text{Setosa}}(1-p_{\text{Setosa}})) (softmax derivative)
- (\displaystyle \frac{\partial z_1}{\partial b_3}=1) if (b_3) is added directly to (z_1); otherwise it follows the network’s wiring.
Multiplying these terms yields the gradient of the loss with respect to (b_3) for a Setosa observation.
Next steps
In the next article we will compute each of these terms explicitly and show how they combine to give the final gradient used in gradient‑descent updates.