We Didn’t Just Train AI on the Internet. We Started Training It on Itself.

Published: (May 28, 2026 at 04:19 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Optimizing for Compute

We’ve optimized for compute as if it were the main constraint:

  • GPUs
  • Clusters
  • Parallelism
  • Faster training runs

But a less visible constraint is emerging: we are running out of high‑quality human data. Worse, we are replacing it with something fundamentally different—synthetic content generated by the very models we are training.

The Lost Human Internet

Early foundation models benefited from a mostly human internet:

  • Stack Overflow answers written under pressure at 2 AM
  • Reddit threads full of disagreement and correction
  • GitHub repos with half‑documented trade‑offs
  • Research papers with actual uncertainty baked in
  • Forums where people argued, failed, and refined ideas

This wasn’t “data” in the traditional sense; it was compressed human reasoning under constraint, chaotic in a useful way.

The Rise of Synthetic Content

Fast forward to now. A large and growing portion of the web consists of:

  • AI‑written blog posts
  • SEO pages generated at scale
  • Code snippets rewritten by multiple LLMs
  • Summaries of summaries of summaries
  • Content optimized for ranking systems, not humans

Individually, none of this looks dangerous. Collectively, it creates something new: a dataset increasingly shaped by model behavior, not human behavior.

The Recursive Training Loop

We are entering a recursive training loop:

Human data → Model training → AI‑generated content → New training data → …

Each cycle slightly reduces:

  • Variance
  • Originality
  • Contradiction density
  • “Weird” human edge cases

and increases:

  • Pattern repetition
  • Stylistic convergence
  • Safe, average reasoning

Consequences for Model Scaling

The misconception that more compute = better intelligence ignores distribution collapse. If the dataset slowly shifts toward:

  • Repetition
  • Templated reasoning
  • Averaged explanations
  • Low‑information content

then scaling merely yields faster convergence to the same middle‑of‑the‑road answer, not deeper intelligence—just more confident imitation.

If you’ve used multiple LLMs recently, you’ve probably felt it: they are converging not in capability, but in voice.

  • Same structured bullet reasoning
  • Same “balanced” tone
  • Same careful disclaimers
  • Same predictable framing patterns
  • Same safe explanatory style

Industry Response

Major AI labs are quietly doing the same thing:

  • Licensing publisher archives
  • Paying for forum and community data
  • Locking down Reddit‑scale conversations
  • Building proprietary human datasets

High‑quality human‑generated data has become infrastructure, and infrastructure determines ceilings more than model size.

A Subtle Failure Mode

People often ask, “Will AI become too powerful?” The more realistic, subtler failure mode is:

AI systems becoming increasingly self‑referential, trained on echoes of their own outputs.

When that happens, we lose:

  • Edge‑case reasoning
  • Novelty in thought
  • Contradiction signals
  • Messy human intuition
  • Unexpected leaps

These ingredients produced the breakthroughs in the first place.

Diverging Internet Layers

We are likely splitting into two internet layers:

  1. Expensive, curated, licensed, hard‑to‑replicate
  2. Cheap, scalable, increasingly self‑referential

The gap between these layers will define model quality more than parameter count ever will.

Updating the Narrative

We often say, “AI is trained on the internet.” That’s outdated. A more precise version is:

“AI is now being trained on the internet after it has been shaped by earlier versions of AI.”

This single shift changes the entire system dynamics. The internet didn’t just train AI; it gave AI structure, tone, and reasoning patterns. Now AI is feeding back into that same system.

Outlook

We may be entering a phase where intelligence improvement is limited not by compute, but by how long we can preserve uncompressed human signal in a self‑referential system. Once that signal is gone, we lose variation, and without variation, intelligence stops compounding.

If this resonates, I originally wrote a short‑form version of this idea here:

👉

I’d be interested to hear other perspectives—especially from people building or training models today.

0 views
Back to Blog

Related posts

Read more »