Cross Entropy Derivatives, Part 3: Chain Rule for a Single Output Class

Published: (February 3, 2026 at 02:28 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

![Cover image for Cross Entropy Derivatives, Part 3: Chain Rule for a Single Output Class](https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F607e6xwa7p40llfkxevy.png)

In the [previous article](https://dev.to/rijultp/cross-entropy-derivatives-part-2-setting-up-the-derivative-with-respect-to-a-bias-32gh) we prepared a chain‑rule equation to compute the derivative of cross‑entropy with respect to bias **b₃**.  
We will solve that equation step‑by‑step in this article.

---

## 1️⃣ Derivative of the cross‑entropy with respect to the predicted probability for *Setosa*

![Derivative of cross‑entropy w.r.t. predicted probability (Setosa)](https://media2.dev.to/dynamic/image/width=800,height=&fit=scale-down,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg835cy6hdau5b648l71x.png)

We use the familiar formula  

\[
\frac{\partial \, \text{CE}}{\partial \hat{y}} = -\frac{y}{\hat{y}}
\]

Applying it here gives  

\[
\frac{\partial \, \text{CE}}{\partial \hat{y}_{\text{Setosa}}}
   = -\frac{1}{\hat{y}_{\text{Setosa}}}
\]

![Result of the derivative w.r.t. predicted probability](https://media2.dev.to/dynamic/image/width=800,height=&fit=scale-down,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flezf06ghq1l1hlim2fi1.png)

---

## 2️⃣ Derivative of the predicted probability with respect to the raw output for *Setosa*

First, write the soft‑max equation for the predicted probability of *Setosa*:

\[
\hat{y}_{\text{Setosa}} = 
\frac{e^{z_{\text{Setosa}}}}
     {e^{z_{\text{Setosa}}}+e^{z_{\text{Versicolor}}}+e^{z_{\text{Virginica}}}}
\]

Taking the derivative with respect to the raw output \(z_{\text{Setosa}}\) yields

\[
\frac{\partial \hat{y}_{\text{Setosa}}}{\partial z_{\text{Setosa}}}
   = \hat{y}_{\text{Setosa}}\bigl(1-\hat{y}_{\text{Setosa}}\bigr)
\]

![Derivative of soft‑max w.r.t. raw output (Setosa)](https://media2.dev.to/dynamic/image/width=800,height=&fit=scale-down,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbb2b4op53jw564yhu94d.png)

---

## 3️⃣ Derivative of the raw output for *Setosa* with respect to the bias **b₃**

The raw output for *Setosa* can be expressed as  

\[
z_{\text{Setosa}} = w_{3}^{\top}x + b_{3}
\]

Hence  

\[
\frac{\partial z_{\text{Setosa}}}{\partial b_{3}} = 1
\]

The visual explanation:

- The blue bent surface (other weights) is independent of \(b_{3}\) → derivative = 0.  
- The orange bent surface (other biases) is also independent of \(b_{3}\) → derivative = 0.  
- The bias term itself varies linearly with \(b_{3}\) → derivative = 1.

![Derivative of raw output w.r.t. bias b₃](https://media2.dev.to/dynamic/image/width=800,height=&fit=scale-down,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ntjskcssk82qcqisdp7.png)

---

## 4️⃣ Putting it all together – Chain rule

\[
\frac{\partial \,\text{CE}}{\partial b_{3}}
   = \frac{\partial \,\text{CE}}{\partial \hat{y}_{\text{Setosa}}}
     \times
     \frac{\partial \hat{y}_{\text{Setosa}}}{\partial z_{\text{Setosa}}}
     \times
     \frac{\partial z_{\text{Setosa}}}{\partial b_{3}}
\]

Substituting the three pieces derived above:

\[
\frac{\partial \,\text{CE}}{\partial b_{3}}
   = \Bigl(-\frac{1}{\hat{y}_{\text{Setosa}}}\Bigr)
     \times
     \bigl(\hat{y}_{\text{Setosa}}\bigl(1-\hat{y}_{\text{Setosa}}\bigr)\bigr)
     \times
     1
   = \hat{y}_{\text{Setosa}} - 1
\]

When the observed class is *Setosa* (i.e., the true label \(y_{\text{Setosa}} = 1\)), the derivative simplifies to  

\[
\boxed{\displaystyle \frac{\partial \,\text{CE}}{\partial b_{3}} = \hat{y}_{\text{Setosa}} - 1}
\]

This is the gradient of the cross‑entropy loss with respect to the bias term \(b_{3}\) for the *Setosa* class.  

---

*End of Part 3 – Chain Rule for a Single Output Class.*

Cleaned Markdown Content


Cross‑entropy derivative for Setosa

So, when the predicted probability for Setosa is used to compute the cross‑entropy, the derivative of the cross‑entropy with respect to (b_3) is

Derivative formula

In the next article, we will continue by applying the same process for Virginica.


Looking for an easier way to install tools, libraries, or entire repositories?

Try Installerpedia – a community‑driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

…and you’re done! 🚀

Installerpedia Screenshot

🔗 Explore Installerpedia here

Back to Blog

Related posts

Read more »

AI and Trust (2023)

Article URL: https://www.schneier.com/blog/archives/2023/12/ai-and-trust.html Comments URL: https://news.ycombinator.com/item?id=46877075 Points: 73 Comments: 1...