We Didn’t Just Train AI on the Internet. We Started Training It on Itself.
Source: Dev.to
Optimizing for Compute
We’ve optimized for compute as if it were the main constraint:
- GPUs
- Clusters
- Parallelism
- Faster training runs
But a less visible constraint is emerging: we are running out of high‑quality human data. Worse, we are replacing it with something fundamentally different—synthetic content generated by the very models we are training.
The Lost Human Internet
Early foundation models benefited from a mostly human internet:
- Stack Overflow answers written under pressure at 2 AM
- Reddit threads full of disagreement and correction
- GitHub repos with half‑documented trade‑offs
- Research papers with actual uncertainty baked in
- Forums where people argued, failed, and refined ideas
This wasn’t “data” in the traditional sense; it was compressed human reasoning under constraint, chaotic in a useful way.
The Rise of Synthetic Content
Fast forward to now. A large and growing portion of the web consists of:
- AI‑written blog posts
- SEO pages generated at scale
- Code snippets rewritten by multiple LLMs
- Summaries of summaries of summaries
- Content optimized for ranking systems, not humans
Individually, none of this looks dangerous. Collectively, it creates something new: a dataset increasingly shaped by model behavior, not human behavior.
The Recursive Training Loop
We are entering a recursive training loop:
Human data → Model training → AI‑generated content → New training data → …
Each cycle slightly reduces:
- Variance
- Originality
- Contradiction density
- “Weird” human edge cases
and increases:
- Pattern repetition
- Stylistic convergence
- Safe, average reasoning
Consequences for Model Scaling
The misconception that more compute = better intelligence ignores distribution collapse. If the dataset slowly shifts toward:
- Repetition
- Templated reasoning
- Averaged explanations
- Low‑information content
then scaling merely yields faster convergence to the same middle‑of‑the‑road answer, not deeper intelligence—just more confident imitation.
If you’ve used multiple LLMs recently, you’ve probably felt it: they are converging not in capability, but in voice.
- Same structured bullet reasoning
- Same “balanced” tone
- Same careful disclaimers
- Same predictable framing patterns
- Same safe explanatory style
Industry Response
Major AI labs are quietly doing the same thing:
- Licensing publisher archives
- Paying for forum and community data
- Locking down Reddit‑scale conversations
- Building proprietary human datasets
High‑quality human‑generated data has become infrastructure, and infrastructure determines ceilings more than model size.
A Subtle Failure Mode
People often ask, “Will AI become too powerful?” The more realistic, subtler failure mode is:
AI systems becoming increasingly self‑referential, trained on echoes of their own outputs.
When that happens, we lose:
- Edge‑case reasoning
- Novelty in thought
- Contradiction signals
- Messy human intuition
- Unexpected leaps
These ingredients produced the breakthroughs in the first place.
Diverging Internet Layers
We are likely splitting into two internet layers:
- Expensive, curated, licensed, hard‑to‑replicate
- Cheap, scalable, increasingly self‑referential
The gap between these layers will define model quality more than parameter count ever will.
Updating the Narrative
We often say, “AI is trained on the internet.” That’s outdated. A more precise version is:
“AI is now being trained on the internet after it has been shaped by earlier versions of AI.”
This single shift changes the entire system dynamics. The internet didn’t just train AI; it gave AI structure, tone, and reasoning patterns. Now AI is feeding back into that same system.
Outlook
We may be entering a phase where intelligence improvement is limited not by compute, but by how long we can preserve uncompressed human signal in a self‑referential system. Once that signal is gone, we lose variation, and without variation, intelligence stops compounding.
If this resonates, I originally wrote a short‑form version of this idea here:
👉
I’d be interested to hear other perspectives—especially from people building or training models today.