Understanding AI from First Principles: Multi-Layer Perceptrons and the Hidden Layer Breakthrough

Published: 3 days ago (February 17, 2026 at 10:14 AM EST)

5 min read

Source: Dev.to

“The perceptron has many limitations… the most serious is its inability to learn even the simplest nonlinear functions.” – Marvin Minsky

The Problem That Stumped AI

In my last post I mentioned that a perceptron can learn AND, OR, and NAND gates perfectly.
But there was one simple logic gate it could never learn, no matter how long you trained it:

XOR (exclusive‑or)

XOR Truth Table

┌─────────┬─────────┬────────┐
│ Input 1 │ Input 2 │ Output │
├─────────┼─────────┼────────┤
│    0    │    0    │   0    │
│    0    │    1    │   1    │
│    1    │    0    │   1    │
│    1    │    1    │   0    │
└─────────┴─────────┴────────┘

When Marvin Minsky and Seymour Papert published Perceptrons (1969) they proved mathematically that single‑layer perceptrons cannot solve XOR.
The revelation triggered the first AI winter: funding dried up, research stalled, and neural networks were largely abandoned for over a decade.

Why XOR Is Special

The Geometry of Impossibility

A perceptron draws one straight line to separate classes.
For XOR you need the output to be 1 when the inputs differ and 0 when they are the same:

    Input 2
      ↑
  1 │  [1]    [0]
    │
  0 │  [0]    [1]
    └──────────────→ Input 1
       0       1

[1] = Output 1 (red squares)
[0] = Output 0 (blue circles)

Try drawing a single straight line that separates the red squares from the blue circles – you can’t. The pattern is diagonal; you’d need two lines or a curve.

This is what “not linearly separable” means.

For AND and OR, all the 1’s lie on one side of a line and all the 0’s on the other.
For XOR the classes are interleaved, so a linear classifier is mathematically impossible.

The Breakthrough: Hidden Layers

When I was a kid, single‑digit addition was trivial:

3 + 5 = 8

But multi‑digit addition confused me:

27 + 15 → 2+1 = 3, 7+5 = 12 → 312 (wrong!)

I was treating the two columns as independent single‑digit problems.
The missing piece was the carry:

7 + 5 = 12 → write 2, carry 1 to the tens column.

The carry is an intermediate, non‑linear transformation that changes the next step.
That’s exactly what a hidden layer does in a neural network.

A single‑layer perceptron is like single‑digit addition – inputs go straight to output, no transformation.
Stacking more linear layers still yields a single straight line; no new capability appears.
Adding a non‑linear activation (sigmoid, ReLU, etc.) introduces the “carry” – it reshapes the space, making XOR solvable.

Solving XOR: The “Aha!” Moment

A 2‑2‑1 network (2 inputs, 2 hidden neurons, 1 output) can learn XOR.

┌──────────────────────────────────────┐
│ Hidden Neuron 1: learns OR pattern   │
│   (fires when x₁ OR x₂ = 1)          │
│                                      │
│ Hidden Neuron 2: learns AND pattern │
│   (fires when x₁ AND x₂ = 1)         │
│                                      │
│ Output neuron: combines them          │
│   (OR but NOT AND = XOR)              │
└──────────────────────────────────────┘

When I first ran the code and saw XOR work, I realized the hidden layer isn’t just extra complexity – it transforms the problem into something linearly separable.

Note: The weights/biases shown in the diagram are hand‑crafted for the XOR problem.

Interactive Playground

Run the interactive playground to see the curved decision boundary in action. Adjust the weight sliders to watch the boundary evolve from weak to strong, and compare it with a perceptron’s straight‑line attempt.

Repository: `perceptrons-to-transformers` – `02-xor-problem`

02-multi-layer-perceptron/
│
├─ mlp.py          # Clean MLP implementation
└─ mlp_playground.py  # Streamlit app (interactive visualisation)

The playground lets you:

Visualise the curved decision boundary that solves XOR.
Adjust weights in real‑time and observe the boundary shift.
View the full network architecture with all weights labelled.
Compare a perceptron’s straight line vs. the MLP’s curve.

What This Unlocked

Solving XOR now seems trivial, but it was the breakthrough that unlocked everything.
The real insight was that hidden layers enable non‑linear thinking.

In the 1980s, Geoffrey Hinton, David Rumelhart, and Ronald Williams showed that multi‑layer networks could be trained with back‑propagation. Suddenly, problems once thought impossible became solvable, and the AI winter began to thaw.

From Perceptrons to Transformers – Part 2 of 18

Series: From Perceptrons to Transformers

What We’ve Learned So Far

Perceptrons learned to draw lines (linear boundaries).
MLPs learned to draw curves (non‑linear boundaries).
Deep networks learned hierarchies (edges → shapes → objects → concepts).

Today’s neural networks—whether image classifiers or large language models like GPT‑4—all follow the same principle: stack layers with non‑linear activations to transform data into increasingly meaningful representations.

All of this stems from adding that first hidden layer.

What’s Next?

We can now build networks that solve XOR. But there’s one crucial question: How do we learn the weights?

The XOR network shown earlier uses hand‑crafted weights—values I set manually. For real‑world problems with thousands of inputs and millions of weights, manual tuning is impossible.

The algorithm that makes learning feasible is backpropagation. It allows networks to learn from their mistakes and gradually improve.

In the next post we’ll dive into backpropagation—the algorithm that ties everything together. It involves calculus, but I promise to make it intuitive.

References

Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press. Available at:

Code

GitHub Repository

Understanding AI from First Principles: Multi-Layer Perceptrons and the Hidden Layer Breakthrough

“The perceptron has many limitations… the most serious is its inability to learn even the simplest nonlinear functions.” – Marvin Minsky

The Problem That Stumped AI

XOR Truth Table

Why XOR Is Special

The Geometry of Impossibility

The Breakthrough: Hidden Layers

Solving XOR: The “Aha!” Moment

Interactive Playground

Repository: `perceptrons-to-transformers` – `02-xor-problem`

What This Unlocked

From Perceptrons to Transformers – Part 2 of 18

What We’ve Learned So Far

What’s Next?

References

Tags

Code

Related posts

VoxCPM: A Novel Tokenizer-Free Approach to Context-Aware Speech Generation and Voice Cloning

AI in Multiple GPUs: How GPUs Communicate

Lyria 3: Inside Google DeepMind’s Most Advanced AI Music Model

Fei-Fei Li's World Labs raised $1B from A16Z, Nvidia to advance its world models

“The perceptron has many limitations… the most serious is its inability to learn even the simplest nonlinear functions.” – Marvin Minsky

The Problem That Stumped AI

XOR Truth Table

Why XOR Is Special

The Geometry of Impossibility

The Breakthrough: Hidden Layers

Solving XOR: The “Aha!” Moment

Interactive Playground

Repository: perceptrons-to-transformers – 02-xor-problem

What This Unlocked

From Perceptrons to Transformers – Part 2 of 18

What We’ve Learned So Far

What’s Next?

References

Tags

Code

Related posts

VoxCPM: A Novel Tokenizer-Free Approach to Context-Aware Speech Generation and Voice Cloning

AI in Multiple GPUs: How GPUs Communicate

Lyria 3: Inside Google DeepMind’s Most Advanced AI Music Model

Fei-Fei Li's World Labs raised $1B from A16Z, Nvidia to advance its world models

“The perceptron has many limitations… the most serious is its inability to learn even the simplest nonlinear functions.” – Marvin Minsky

Repository: `perceptrons-to-transformers` – `02-xor-problem`