How AI Learns: Gradient Descent Explained Through a Midnight Smoky Jollof Adventure

Published: 1 hour ago (December 16, 2025 at 02:59 AM EST)

6 min read

Source: Dev.to

Many Aspects of the Modern World Are Powered by Artificial Intelligence

Artificial intelligence (AI) now drives countless facets of our lives, accelerating human civilization dramatically.

Faster disease detection
Automated decision‑making
Breakthroughs in medical imaging
Quiet, rapid adoption of AI in law firms and the judicial system
Reshaping the future of agriculture

Despite this tremendous progress, many people still don’t understand where AI gets its brilliance from. AI’s ability to identify errors and iteratively improve is indeed amazing.

This article will gently hold your hand and explain the true super‑power behind AI and machine learning: a simple mathematical algorithm called gradient descent.

What Is Gradient Descent?

Gradient descent is a general‑purpose mathematical algorithm that can find the best solutions to a very wide range of problems. In machine learning, it works by rapidly updating parameters to quickly minimize a loss (or cost) function.

In very simple terms, gradient descent helps AI figure out how wrong it is and how to become less wrong.

To understand gradient descent, its complexities, and its purpose, let’s look “under the hood” and reason like an AI model.

Midnight Smoky Jollof Adventure

Imagine you’re at a Thanksgiving dinner. Your mom has cooked a special, taste‑bud‑pleasing Nigerian jollof rice. The night is perfect, you reconnect with your siblings, and then everyone goes to bed.

In the middle of the night, your brain and tongue keep craving more. The smoky jollof rice is so tantalising that you can smell it several feet away.

You resist, but eventually you get up and head toward the kitchen—in total darkness. You don’t want to get caught, nor do you want to trip over anything.

Mapping the House to a Graph

Think of the house floor as graph paper:

Axis	Meaning
X‑axis	Left‑right position
Y‑axis	Forward‑backward position

Your location is a coordinate ((x, y)).
You start at point ((1, 1)).

The Loss Function

We need a way to measure how close we are to the jollof rice (the kitchen).
Let the kitchen be at ((3, 4)).

A simple distance formula is

[ \text{Distance} = \sqrt{(x-3)^2 + (y-4)^2} ]

For convenience we’ll use the squared distance as our loss:

[ \boxed{L(x, y) = (x-3)^2 + (y-4)^2} ]

The loss tells us how “wrong” we are: the higher the loss, the farther we are from the kitchen. Our goal is to reduce the loss as much as possible.

Starting Loss

At ((1, 1)):

[ L(1,1) = (1-3)^2 + (1-4)^2 = (-2)^2 + (-3)^2 = 4 + 9 = 13 ]

We’re quite far from the kitchen.

Testing Directions (Manual Tweaks)

New Position	New Loss	Change
((1.001, 1))	((1.001-3)^2 + (1-4)^2 = 3.996 + 9 = 12.996)	(-0.004)
((1, 1.001))	((1-3)^2 + (1.001-4)^2 = 4 + 8.994 = 12.994)	(-0.006)

Both tiny moves reduce the loss, showing we’re heading in the right direction.

The Mathematical Shortcut (Derivatives)

Instead of testing each direction, we compute the gradient—the vector of partial derivatives.

For

[ L(x, y) = (x-3)^2 + (y-4)^2 ]

the partial derivatives are:

[ \frac{\partial L}{\partial x} = 2(x-3) \qquad \frac{\partial L}{\partial y} = 2(y-4) ]

At ((x, y) = (1, 1))

[ \frac{\partial L}{\partial x}\bigg|{(1,1)} = 2(1-3) = -4 \ \frac{\partial L}{\partial y}\bigg|{(1,1)} = 2(1-4) = -6 ]

Interpretation

For every tiny step right ((+\Delta x)), the loss decreases by roughly (4\Delta x).
For every tiny step forward ((+\Delta y)), the loss decreases by roughly (6\Delta y).

The Gradient Vector

[ \mathbf{g} = \begin{bmatrix} -4 \ -6 \end{bmatrix} ]

Learning Rate

We need a learning rate (\eta) to control step size. Choose (\eta = 0.1) (not too small, not too large).

Update Rule

[ \mathbf{p}{\text{new}} = \mathbf{p}{\text{old}} - \eta , \mathbf{g} ]

Applying it:

[ \begin{aligned} x_{\text{new}} &= 1 - 0.1(-4) = 1 + 0.4 = 1.4 \ y_{\text{new}} &= 1 - 0.1(-6) = 1 + 0.6 = 1.6 \end{aligned} ]

So we move from ((1, 1)) to ((1.4, 1.6)).

New Loss

[ L(1.4, 1.6) = (1.4-3)^2 + (1.6-4)^2 = (-1.6)^2 + (-2.4)^2 = 2.56 + 5.76 = 8.32 ]

Loss dropped from 13 → 8.32 – a big improvement!

Next Iterations

Second Update

At ((1.4, 1.6)):

[ \frac{\partial L}{\partial x}=2(1.4-3) = -3.2 \ \frac{\partial L}{\partial y}=2(1.6-4) = -4.8 ]

Gradient (\mathbf{g}= \begin{bmatrix} -3.2 \ -4.8 \end{bmatrix})

[ \begin{aligned} x_{\text{new}} &= 1.4 - 0.1(-3.2) = 1.4 + 0.32 = 1.72 \ y_{\text{new}} &= 1.6 - 0.1(-4.8) = 1.6 + 0.48 = 2.08 \end{aligned} ]

Loss at ((1.72, 2.08)):

[ L = (-1.28)^2 + (-1.92)^2 = 1.6384 + 3.6864 \approx 5.3248 ]

Loss fell from 8.32 → 5.32. You’re now at the kitchen door!

Because the loss function is convex, repeated gradient‑descent steps are guaranteed to converge to the global optimum (the lowest possible loss). When the gradient becomes zero, you’ve reached a critical point—in this case, the minimum.

In Real Machine Learning

Parameters: Modern models have millions or billions of weights, not just two.
Loss Functions: Common choices include Cross‑Entropy, Mean Squared Error, and many others, rather than simple squared distance.
Optimization: Gradient descent (or its many variants) is the workhorse that updates all those parameters to minimize the chosen loss.

In essence, gradient descent measures the local gradient of the error function with respect to the parameter vector (\theta) and moves in the direction of steepest descent. When the gradient is zero, you have reached a critical point—ideally a minimum.

Now you have a clear, intuitive picture of the “super‑power” behind AI: a humble mathematical algorithm that, step by step, guides massive models toward better performance.

# Gradient Descent and Its Variants

Navigate a complex, multi‑dimensional *loss landscape* with hills, valleys, and plateaus.  
But the core algorithm—the relentless optimization engine—remains **Gradient Descent** and its smarter variants (Adam, RMSProp).

This is exactly how gradient descent works and how artificial intelligence can learn patterns and improve its predictions.

Types of Gradient Descent

Batch Gradient Descent

All training examples are utilized to compute the gradient, then a single update step is taken.

[ \theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \frac{1}{m}\sum_{i=1}^{m}\nabla L(\theta, x_i, y_i) ]

(m) = total number of training examples
(\eta) = learning rate
(\nabla L) = gradient of the loss for a given example

Stochastic Gradient Descent (SGD)

Instead of using the whole dataset, SGD picks a random instance from the training set at each step and computes the gradient based only on that instance. This makes the algorithm much faster but also noisier. The stochastic noise can help escape local minima, though with a constant learning rate it tends to oscillate around the minimum rather than converge exactly.

For each random example (i):

[ \theta = \theta - \eta , \nabla L(\theta, x_i, y_i) ]

Mini‑Batch Gradient Descent

A small batch (e.g., 16 or 32 examples) is used to compute the gradient, then an update is performed. This strikes a sweet spot between the stability of batch gradient descent and the speed of SGD.

For each batch (B) of size (b):

[ \nabla L_{\text{batch}} = \frac{1}{b}\sum_{i \in B}\nabla L(\theta, x_i, y_i) ]

[ \theta = \theta - \eta , \nabla L_{\text{batch}} ]

Conclusion

Understanding how gradient descent works is profound. It shows that artificial intelligence—and learning in general—is not about being perfect from the start; it’s about having a reliable method to become less wrong quickly and accurately.

This mirrors human learning: psychologists often recommend reflective or meditative time to reason about mistakes and devise fixes. Gradient descent thus serves as a conceptual bridge between artificial intelligence and the human mind.