Why AI Is Training on Its Own Garbage (and How to Fix It)
Source: Towards Data Science
The Data Dilemma in AI Training
If you’ve been using LLMs or AI agents for a while, you’ve probably wondered how these tools will be trained in the near future. A common concern is that we might have already exhausted the high‑quality, human‑generated data needed to train ever‑larger models.
The “Model Collapse” Problem
- Continuous data growth: New content is added to the web every day.
- AI‑generated noise: An increasing share of that new content is itself generated by AI.
- Self‑reinforcement: Training on public web data eventually means training on the outputs of previous models.
- Model Collapse: Researchers describe this feedback loop as Model Collapse—models learning from the mistakes of their predecessors until the system degrades into nonsense.
A Different Perspective
What if we aren’t actually running out of data, but simply looking in the wrong place?
In the rest of this article, I’ll break down the key insights from this brilliant paper, which proposes alternative data sources and strategies to keep AI training sustainable.
The Web We Already Use and the Web That Matters
Most of us consider the web a single source of information, but in reality there are at least two distinct layers.
Surface Web
The indexed, public portion of the internet—think Reddit, Wikipedia, news sites, and other pages that search engines can crawl. This is the data we have been scraping and over‑using for years to train today’s mainstream AI models.
Deep Web
Not to be confused with the “dark web” or illegal content.
The Deep Web consists of everything behind a login or firewall—any online content that isn’t publicly indexed. Examples include:
- Hospital patient portals
- Bank internal dashboards
- Enterprise document archives
- Private databases
- Years of email stored behind authentication screens
These are ordinary, often boring, but incredibly valuable data sources.
Why the Deep Web Matters
- Size: Studies suggest the Deep Web is orders of magnitude larger than the Surface Web.
- Quality: Content is typically cleaner, authenticated, and organized by people who care about its accuracy.
- Reliability: Compared to the Surface Web, which can be noisy, full of misinformation, SEO‑optimized, and increasingly designed to mislead or poison AI models, Deep Web data (e.g., medical records, verified financial documents, internal databases) offers higher fidelity.
The Problem
The biggest obstacle is privacy. Extracting large volumes of sensitive data—such as medical records—without addressing legal and ethical considerations would lead to catastrophic consequences.
The PROPS Framework
Protected Pipelines (PROPS) is a privacy‑preserving architecture introduced by Ari Juels (Cornell Tech), Farinaz Koushanfar (UCSD), and Laurence Moroney (former Google AI Lead). It bridges sensitive data and the AI models that need it without ever exposing the raw data.
How PROPS Works
- Permission – The data owner logs into their own portal (e.g., a health‑record system) and explicitly authorizes a specific use of their data.
- Privacy‑Preserving Oracle – The oracle acts as a trusted middle‑man:
- It accesses the owner’s private source, verifies that the data is authentic, and then provides a cryptographic proof to the AI system.
- The AI never sees the raw data; it only receives a statement such as “I have seen the original documents and attest they are authentic.”
- Existing implementations include DECO, a protocol that lets users prove they retrieved a particular piece of data over a secure TLS channel.
- Secure Enclave – Training occurs inside a hardware‑based trusted execution environment (TEE):
- The AI model and the private data are loaded into the enclave, which is cryptographically sealed.
- No human, developer, or external process can inspect the data while training is in progress.
- Result – After training, only the updated model weights (the learned knowledge) exit the enclave. The raw data remains locked inside until the session ends and is then securely destroyed.
Benefits
- Data never leaves the owner’s domain – the AI receives only verifiable proofs, not the data itself.
- Fine‑grained consent – users know exactly what they are permitting and can be compensated proportionally to the value of their contribution.
- Stronger trust – the relationship between data owners and AI systems shifts from “hand‑over” to “verified use.”
The PROPS framework thus offers a practical, cryptographically sound solution to the data‑availability challenges that modern AI models face.
Why Not Just Use Synthetic Data?
Some might ask: “Why bother with this complex setup when we can simply generate synthetic data?”
The answer is that synthetic data is a diversity killer. By definition, synthetic‑data generators reinforce the middle of the bell curve. If you have a rare medical condition that affects only 0.01 % of the population, a synthetic data generator will likely smooth it out as “noise.”
Models trained on synthetic data become progressively worse at handling outliers. PROPS solves this by creating a secure way for real people with rare conditions or unique backgrounds to opt‑in. It turns data sharing from a privacy risk into a data marketplace where valuable data receives the compensation it deserves.
Inference Matters Too
Most discussions focus on training, but PROPS also has an interesting application on the inference side.
Example: Loan Decision Workflow
- Authorization – You authorize a Loan Decision Model (LDM) to talk directly to your bank.
- Verification – The bank confirms your balance via a privacy‑preserving oracle.
- Decision – The LDM makes a decision.
- Result – The lender receives a verified “Yes” or “No” without ever seeing your private documents.
This eliminates the risk of data leaks and makes it nearly impossible for fraudsters to submit photoshopped documents.
What’s Stopping This From Happening in 2026?
It comes down to scale and infrastructure.
- The most robust version of PROPS requires training inside a hardware‑backed secure enclave (e.g., Intel SGX or NVIDIA’s H100 TEEs).
- These enclaves work well at a small scale, but scaling them to the massive GPU clusters needed for frontier LLMs is still an open engineering problem.
- Coordinating large clusters in perfect, encrypted sync is a non‑trivial challenge.
Researchers are clear: PROPS isn’t a finished product yet—it’s a persuasive proof‑of‑concept. However, a lighter‑weight version is deployable today. Even without full hardware guarantees, you can build systems that give users meaningful assurance, which is already an improvement over asking someone to email you a PDF.
My Final Thoughts
PROPS isn’t a brand‑new technology; it’s a new application of existing tools. Privacy‑preserving oracles have been used in the blockchain and Web3 space (e.g., Chainlink) for years. The insight is recognizing that the same tools can help solve the AI data crisis.
The “data crisis” isn’t a lack of information—it’s a lack of trust. We have more than enough data to build the next generation of AI, but it’s locked behind the doors of the Deep Web. The snake doesn’t have to eat its tail; it just needs to find a better garden.
Connect with Me
- LinkedIn: Sabrine Bendimerad
- Medium: @sabrine.bendimerad1
- Instagram: tinyurl.com/datailearn