We Didn’t Just Train AI on the Internet. We Started Training It on Itself.

Published: 1 week ago (May 28, 2026 at 04:19 PM EDT)

4 min read

Source: Dev.to

Optimizing for Compute

We’ve optimized for compute as if it were the main constraint:

GPUs
Clusters
Parallelism
Faster training runs

But a less visible constraint is emerging: we are running out of high‑quality human data. Worse, we are replacing it with something fundamentally different—synthetic content generated by the very models we are training.

The Lost Human Internet

Early foundation models benefited from a mostly human internet:

Stack Overflow answers written under pressure at 2 AM
Reddit threads full of disagreement and correction
GitHub repos with half‑documented trade‑offs
Research papers with actual uncertainty baked in
Forums where people argued, failed, and refined ideas

This wasn’t “data” in the traditional sense; it was compressed human reasoning under constraint, chaotic in a useful way.

The Rise of Synthetic Content

Fast forward to now. A large and growing portion of the web consists of:

AI‑written blog posts
SEO pages generated at scale
Code snippets rewritten by multiple LLMs
Summaries of summaries of summaries
Content optimized for ranking systems, not humans

Individually, none of this looks dangerous. Collectively, it creates something new: a dataset increasingly shaped by model behavior, not human behavior.

The Recursive Training Loop

We are entering a recursive training loop:

Human data → Model training → AI‑generated content → New training data → …

Each cycle slightly reduces:

Variance
Originality
Contradiction density
“Weird” human edge cases

and increases:

Pattern repetition
Stylistic convergence
Safe, average reasoning

Consequences for Model Scaling

The misconception that more compute = better intelligence ignores distribution collapse. If the dataset slowly shifts toward:

Repetition
Templated reasoning
Averaged explanations
Low‑information content

then scaling merely yields faster convergence to the same middle‑of‑the‑road answer, not deeper intelligence—just more confident imitation.

If you’ve used multiple LLMs recently, you’ve probably felt it: they are converging not in capability, but in voice.

Same structured bullet reasoning
Same “balanced” tone
Same careful disclaimers
Same predictable framing patterns
Same safe explanatory style

Industry Response

Major AI labs are quietly doing the same thing:

Licensing publisher archives
Paying for forum and community data
Locking down Reddit‑scale conversations
Building proprietary human datasets

High‑quality human‑generated data has become infrastructure, and infrastructure determines ceilings more than model size.

A Subtle Failure Mode

People often ask, “Will AI become too powerful?” The more realistic, subtler failure mode is:

AI systems becoming increasingly self‑referential, trained on echoes of their own outputs.

When that happens, we lose:

Edge‑case reasoning
Novelty in thought
Contradiction signals
Messy human intuition
Unexpected leaps

These ingredients produced the breakthroughs in the first place.

Diverging Internet Layers

We are likely splitting into two internet layers:

Expensive, curated, licensed, hard‑to‑replicate
Cheap, scalable, increasingly self‑referential

The gap between these layers will define model quality more than parameter count ever will.

Updating the Narrative

We often say, “AI is trained on the internet.” That’s outdated. A more precise version is:

“AI is now being trained on the internet after it has been shaped by earlier versions of AI.”

This single shift changes the entire system dynamics. The internet didn’t just train AI; it gave AI structure, tone, and reasoning patterns. Now AI is feeding back into that same system.

Outlook

We may be entering a phase where intelligence improvement is limited not by compute, but by how long we can preserve uncompressed human signal in a self‑referential system. Once that signal is gone, we lose variation, and without variation, intelligence stops compounding.

If this resonates, I originally wrote a short‑form version of this idea here:

👉

I’d be interested to hear other perspectives—especially from people building or training models today.

We Didn’t Just Train AI on the Internet. We Started Training It on Itself.

Optimizing for Compute

The Lost Human Internet

The Rise of Synthetic Content

The Recursive Training Loop

Consequences for Model Scaling

Industry Response

A Subtle Failure Mode

Diverging Internet Layers

Updating the Narrative

Outlook

Related posts

I Spent a Week Recording Myself Doing Chores for Money. Who's the Robot Now?

How Cosmos 3 Helps Physical AI Think Before It Acts

How the community trained Gemma to 'Think' with Tunix and TPUs

Meta-Cognitive Regulation Might Be the Most Important AI Skill Nobody Is Talking About