Understanding AI from First Principles: Multi-Layer Perceptrons and the Hidden Layer Breakthrough
Source: Dev.to
“The perceptron has many limitations… the most serious is its inability to learn even the simplest nonlinear functions.” – Marvin Minsky
The Problem That Stumped AI
In my last post I mentioned that a perceptron can learn AND, OR, and NAND gates perfectly.
But there was one simple logic gate it could never learn, no matter how long you trained it:
XOR (exclusive‑or)
XOR Truth Table
┌─────────┬─────────┬────────┐
│ Input 1 │ Input 2 │ Output │
├─────────┼─────────┼────────┤
│ 0 │ 0 │ 0 │
│ 0 │ 1 │ 1 │
│ 1 │ 0 │ 1 │
│ 1 │ 1 │ 0 │
└─────────┴─────────┴────────┘
When Marvin Minsky and Seymour Papert published Perceptrons (1969) they proved mathematically that single‑layer perceptrons cannot solve XOR.
The revelation triggered the first AI winter: funding dried up, research stalled, and neural networks were largely abandoned for over a decade.
Why XOR Is Special
The Geometry of Impossibility
A perceptron draws one straight line to separate classes.
For XOR you need the output to be 1 when the inputs differ and 0 when they are the same:
Input 2
↑
1 │ [1] [0]
│
0 │ [0] [1]
└──────────────→ Input 1
0 1
[1]= Output 1 (red squares)[0]= Output 0 (blue circles)
Try drawing a single straight line that separates the red squares from the blue circles – you can’t. The pattern is diagonal; you’d need two lines or a curve.
This is what “not linearly separable” means.
For AND and OR, all the 1’s lie on one side of a line and all the 0’s on the other.
For XOR the classes are interleaved, so a linear classifier is mathematically impossible.
The Breakthrough: Hidden Layers
When I was a kid, single‑digit addition was trivial:
3 + 5 = 8
But multi‑digit addition confused me:
27 + 15 → 2+1 = 3, 7+5 = 12 → 312 (wrong!)
I was treating the two columns as independent single‑digit problems.
The missing piece was the carry:
7 + 5 = 12 → write 2, carry 1 to the tens column.
The carry is an intermediate, non‑linear transformation that changes the next step.
That’s exactly what a hidden layer does in a neural network.
- A single‑layer perceptron is like single‑digit addition – inputs go straight to output, no transformation.
- Stacking more linear layers still yields a single straight line; no new capability appears.
- Adding a non‑linear activation (sigmoid, ReLU, etc.) introduces the “carry” – it reshapes the space, making XOR solvable.
Solving XOR: The “Aha!” Moment
A 2‑2‑1 network (2 inputs, 2 hidden neurons, 1 output) can learn XOR.
┌──────────────────────────────────────┐
│ Hidden Neuron 1: learns OR pattern │
│ (fires when x₁ OR x₂ = 1) │
│ │
│ Hidden Neuron 2: learns AND pattern │
│ (fires when x₁ AND x₂ = 1) │
│ │
│ Output neuron: combines them │
│ (OR but NOT AND = XOR) │
└──────────────────────────────────────┘
When I first ran the code and saw XOR work, I realized the hidden layer isn’t just extra complexity – it transforms the problem into something linearly separable.
Note: The weights/biases shown in the diagram are hand‑crafted for the XOR problem.
Interactive Playground
Run the interactive playground to see the curved decision boundary in action. Adjust the weight sliders to watch the boundary evolve from weak to strong, and compare it with a perceptron’s straight‑line attempt.
Repository: perceptrons-to-transformers – 02-xor-problem
02-multi-layer-perceptron/
│
├─ mlp.py # Clean MLP implementation
└─ mlp_playground.py # Streamlit app (interactive visualisation)
The playground lets you:
- Visualise the curved decision boundary that solves XOR.
- Adjust weights in real‑time and observe the boundary shift.
- View the full network architecture with all weights labelled.
- Compare a perceptron’s straight line vs. the MLP’s curve.
What This Unlocked
Solving XOR now seems trivial, but it was the breakthrough that unlocked everything.
The real insight was that hidden layers enable non‑linear thinking.
In the 1980s, Geoffrey Hinton, David Rumelhart, and Ronald Williams showed that multi‑layer networks could be trained with back‑propagation. Suddenly, problems once thought impossible became solvable, and the AI winter began to thaw.
From Perceptrons to Transformers – Part 2 of 18
Series: From Perceptrons to Transformers
What We’ve Learned So Far
- Perceptrons learned to draw lines (linear boundaries).
- MLPs learned to draw curves (non‑linear boundaries).
- Deep networks learned hierarchies (edges → shapes → objects → concepts).
Today’s neural networks—whether image classifiers or large language models like GPT‑4—all follow the same principle: stack layers with non‑linear activations to transform data into increasingly meaningful representations.
All of this stems from adding that first hidden layer.
What’s Next?
We can now build networks that solve XOR. But there’s one crucial question: How do we learn the weights?
The XOR network shown earlier uses hand‑crafted weights—values I set manually. For real‑world problems with thousands of inputs and millions of weights, manual tuning is impossible.
The algorithm that makes learning feasible is backpropagation. It allows networks to learn from their mistakes and gradually improve.
In the next post we’ll dive into backpropagation—the algorithm that ties everything together. It involves calculus, but I promise to make it intuitive.
References
- Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
- Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press. Available at:
Tags
#MachineLearning #AI #DeepLearning #NeuralNetworks #MLP