robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

Published: 1 month ago (March 23, 2026 at 03:54 AM EDT)

5 min read

Source: Dev.to

Source: Dev.to

Introduction

You configure robots.txt to block every known bot:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: *
Disallow: /

You also enable Cloudflare Bot Management, set up Akamai, and maybe even a server‑side paywall. Yet when you query ChatGPT about your product, it still cites your website as a source.

I work on GEO (Generative Engine Optimization) projects, auditing how large language models (LLMs) represent brands. Across thousands of prompt‑response pairs we consistently find that 10–20 % of LLM answers cite the brand’s own website—even when every known bot is blocked.

Below are the 8 technical vectors we documented, with academic sources and industry data.

1. Historical Web Archives (Common Crawl)

Scale: 9.5 + petabytes, 300 + billion documents.
Usage: ~2/3 of the 47 LLMs published between 2019–2023 use Common Crawl as training data (GPT‑3, LLaMA, T5, Red Pajama, etc.).
Google’s C4 dataset: 750 GB filtered from Common Crawl.

Source: ACM FAccT 2024 – “A Critical Analysis of Common Crawl”.

Key point – Blocking crawlers today does not retroactively remove content already captured. Those snapshots are permanent, public resources.

JavaScript Paywalls & Common Crawl

Common Crawl does not execute JavaScript. If your paywall depends on client‑side JS, the crawler still captures the full HTML.

document.addEventListener('DOMContentLoaded', () => {
  showPaywall();
});

Alex Reisner documented this for The Atlantic (Nov 2025): Common Crawl was capturing full articles from NYT, WSJ, The Economist, and The Atlantic itself.

2. Bot Identity Spoofing

Some AI bots change their user‑agent or IP when blocked.

Cloudflare (Aug 2024) reported that Perplexity sent:

# Declared user-agent
PerplexityBot/1.0

# What they actually sent
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0

They also rotate ASNs to evade IP‑based blocking.
The evasion ecosystem includes FlareSolverr (Selenium + undetected‑chromedriver), Scrapfly (94–98 % bypass rates), and residential proxy rotation.

3. Syndication Channels Bypass `robots.txt`

Once your content leaves your domain, robots.txt no longer applies.

Original domain (robots.txt: Disallow)
  → RSS feed (no robots.txt)
  → Apple News (different domain)
  → Email newsletter (archived on web)
  → Cross‑posted to social (scraped by bots)
  → API aggregators (reformatted downstream)

Each channel creates a copy outside your control.

Internet Archive

1 + billion pages, 99 + petabytes.
web.archive.org is domain #187 in Google’s C4 dataset.
As of Feb 2026, publishers like The Guardian and NYT began blocking the Wayback Machine over AI concerns (Harvard’s WARC‑GPT can ingest WARC archives directly into RAG pipelines).

4. Real‑Time Fetching by Modern LLMs

Bot	Growth 2024–2025	Mechanism
ChatGPT‑User	+2,825 %	Fetch on user “search the web”
PerplexityBot	+157,490 %	Fetch on every query
Meta‑ExternalFetcher	New in 2024	Meta AI features

These bots claim the fetch is user‑initiated (not autonomous crawling) to argue they are exempt from robots.txt.

Cloudflare reported Anthropic’s bots have crawl‑to‑refer ratios of 38,000:1 to 70,000:1.
Sources: Cloudflare Blog 2025; OpenAI Crawlers Overview.

5. Content Farms & Rewrites

Human or AI‑operated farms copy and rewrite your articles on unrestricted domains:

Scrape the original article.
Rewrite to avoid plagiarism detection.
Publish on a domain with no robots.txt restrictions.
AI crawlers index the rewrite.
LLMs absorb the rewritten version.

In Bartz v. Anthropic PBC, the court ruled that training AI with content from “pirate sites” constituted fair use, setting a precedent for rewritten content.

6. Bots Ignoring `robots.txt`

12.9 % of bots ignore robots.txt entirely (up from 3.3 %). — Paul Calvano, Aug 2025
Duke University (2025): “Several categories of AI‑related crawlers never request robots.txt.”
Kim & Bock (ACM IMC 2025): Scrapers are less likely to comply with more restrictive directives.

Legal Perspective

In Ziff Davis v. OpenAI (2025), the judge described robots.txt as “more like a sign than a fence”—not a technological measure that “effectively controls access” under the DMCA.

7. Metrics Overview

Metric	Value	Source
Bots ignoring `robots.txt`	12.9 %	Paul Calvano, 2025
Top 10K sites with AI bot rules	Only 14 %	Market analysis 2025
Sites with any `robots.txt`	94 % (12.2 M sites)	Global study 2025

8. Mitigation Strategies

Defensive measures (e.g., stricter bot management) reduce direct crawling by 40–60 % for compliant bots, but they cannot affect historical data, syndicated copies, or content‑farm rewrites.
Offensive approach: control the narrative rather than trying to hide.

At 498 Advance we built:

GEOdoctor – technical auditing of brand visibility in LLMs.
S.A.M. (Semantic Alignment Machine) – content alignment across owned media, UGC platforms (social GEO), and authority domains.

Full analysis with all academic sources:

Conclusion

Blocking everything with robots.txt and bot‑management tools is no longer sufficient. Historical archives, real‑time fetches, syndication, and content farms ensure that your website’s content can still appear in LLM outputs. The most effective strategy is to manage and align your narrative across all channels, rather than relying on a “sign” to keep bots out.

Have you encountered this paradox—blocking everything yet still appearing in LLM outputs? Feel free to share your observations.

robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

Introduction

1. Historical Web Archives (Common Crawl)

JavaScript Paywalls & Common Crawl

2. Bot Identity Spoofing

3. Syndication Channels Bypass `robots.txt`

Internet Archive

4. Real‑Time Fetching by Modern LLMs

5. Content Farms & Rewrites

6. Bots Ignoring `robots.txt`

Legal Perspective

7. Metrics Overview

8. Mitigation Strategies

Conclusion

Related posts

The 5 LLM Architecture Patterns That Scale (And 2 That Do Not)

AI-Safe MCP Server for SQL

Stop Writing AI Agent Prompts Like It's 2023: The Framework That Makes Your OpenClaw Agent Actually Work

We built an AI that audits other AI agents (here's how A2A works in production)

Introduction

1. Historical Web Archives (Common Crawl)

JavaScript Paywalls & Common Crawl

2. Bot Identity Spoofing

3. Syndication Channels Bypass robots.txt

Internet Archive

4. Real‑Time Fetching by Modern LLMs

5. Content Farms & Rewrites

6. Bots Ignoring robots.txt

Legal Perspective

7. Metrics Overview

8. Mitigation Strategies

Conclusion

Related posts

The 5 LLM Architecture Patterns That Scale (And 2 That Do Not)

AI-Safe MCP Server for SQL

Stop Writing AI Agent Prompts Like It's 2023: The Framework That Makes Your OpenClaw Agent Actually Work

We built an AI that audits other AI agents (here's how A2A works in production)

3. Syndication Channels Bypass `robots.txt`

6. Bots Ignoring `robots.txt`