robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website

Published: (March 23, 2026 at 03:54 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Introduction

You configure robots.txt to block every known bot:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: *
Disallow: /

You also enable Cloudflare Bot Management, set up Akamai, and maybe even a server‑side paywall. Yet when you query ChatGPT about your product, it still cites your website as a source.

I work on GEO (Generative Engine Optimization) projects, auditing how large language models (LLMs) represent brands. Across thousands of prompt‑response pairs we consistently find that 10–20 % of LLM answers cite the brand’s own website—even when every known bot is blocked.

Below are the 8 technical vectors we documented, with academic sources and industry data.

1. Historical Web Archives (Common Crawl)

  • Scale: 9.5 + petabytes, 300 + billion documents.
  • Usage: ~2/3 of the 47 LLMs published between 2019–2023 use Common Crawl as training data (GPT‑3, LLaMA, T5, Red Pajama, etc.).
  • Google’s C4 dataset: 750 GB filtered from Common Crawl.

Source: ACM FAccT 2024 – “A Critical Analysis of Common Crawl”.

Key point – Blocking crawlers today does not retroactively remove content already captured. Those snapshots are permanent, public resources.

JavaScript Paywalls & Common Crawl

Common Crawl does not execute JavaScript. If your paywall depends on client‑side JS, the crawler still captures the full HTML.

document.addEventListener('DOMContentLoaded', () => {
  showPaywall();
});

Alex Reisner documented this for The Atlantic (Nov 2025): Common Crawl was capturing full articles from NYT, WSJ, The Economist, and The Atlantic itself.

2. Bot Identity Spoofing

Some AI bots change their user‑agent or IP when blocked.

  • Cloudflare (Aug 2024) reported that Perplexity sent:

    # Declared user-agent
    PerplexityBot/1.0
    
    # What they actually sent
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0
  • They also rotate ASNs to evade IP‑based blocking.

  • The evasion ecosystem includes FlareSolverr (Selenium + undetected‑chromedriver), Scrapfly (94–98 % bypass rates), and residential proxy rotation.

3. Syndication Channels Bypass robots.txt

Once your content leaves your domain, robots.txt no longer applies.

Original domain (robots.txt: Disallow)
  → RSS feed (no robots.txt)
  → Apple News (different domain)
  → Email newsletter (archived on web)
  → Cross‑posted to social (scraped by bots)
  → API aggregators (reformatted downstream)

Each channel creates a copy outside your control.

Internet Archive

  • 1 + billion pages, 99 + petabytes.
  • web.archive.org is domain #187 in Google’s C4 dataset.
  • As of Feb 2026, publishers like The Guardian and NYT began blocking the Wayback Machine over AI concerns (Harvard’s WARC‑GPT can ingest WARC archives directly into RAG pipelines).

4. Real‑Time Fetching by Modern LLMs

BotGrowth 2024–2025Mechanism
ChatGPT‑User+2,825 %Fetch on user “search the web”
PerplexityBot+157,490 %Fetch on every query
Meta‑ExternalFetcherNew in 2024Meta AI features

These bots claim the fetch is user‑initiated (not autonomous crawling) to argue they are exempt from robots.txt.

  • Cloudflare reported Anthropic’s bots have crawl‑to‑refer ratios of 38,000:1 to 70,000:1.
  • Sources: Cloudflare Blog 2025; OpenAI Crawlers Overview.

5. Content Farms & Rewrites

Human or AI‑operated farms copy and rewrite your articles on unrestricted domains:

  1. Scrape the original article.
  2. Rewrite to avoid plagiarism detection.
  3. Publish on a domain with no robots.txt restrictions.
  4. AI crawlers index the rewrite.
  5. LLMs absorb the rewritten version.

In Bartz v. Anthropic PBC, the court ruled that training AI with content from “pirate sites” constituted fair use, setting a precedent for rewritten content.

6. Bots Ignoring robots.txt

  • 12.9 % of bots ignore robots.txt entirely (up from 3.3 %). — Paul Calvano, Aug 2025
  • Duke University (2025): “Several categories of AI‑related crawlers never request robots.txt.”
  • Kim & Bock (ACM IMC 2025): Scrapers are less likely to comply with more restrictive directives.

In Ziff Davis v. OpenAI (2025), the judge described robots.txt as “more like a sign than a fence”—not a technological measure that “effectively controls access” under the DMCA.

7. Metrics Overview

MetricValueSource
Bots ignoring robots.txt12.9 %Paul Calvano, 2025
Top 10K sites with AI bot rulesOnly 14 %Market analysis 2025
Sites with any robots.txt94 % (12.2 M sites)Global study 2025

8. Mitigation Strategies

  • Defensive measures (e.g., stricter bot management) reduce direct crawling by 40–60 % for compliant bots, but they cannot affect historical data, syndicated copies, or content‑farm rewrites.
  • Offensive approach: control the narrative rather than trying to hide.

At 498 Advance we built:

  • GEOdoctor – technical auditing of brand visibility in LLMs.
  • S.A.M. (Semantic Alignment Machine) – content alignment across owned media, UGC platforms (social GEO), and authority domains.

Full analysis with all academic sources:

Conclusion

Blocking everything with robots.txt and bot‑management tools is no longer sufficient. Historical archives, real‑time fetches, syndication, and content farms ensure that your website’s content can still appear in LLM outputs. The most effective strategy is to manage and align your narrative across all channels, rather than relying on a “sign” to keep bots out.

Have you encountered this paradox—blocking everything yet still appearing in LLM outputs? Feel free to share your observations.

0 views
Back to Blog

Related posts

Read more »

AI-Safe MCP Server for SQL

Overview Giving an AI direct database access sounds useful at first, but it quickly becomes dangerous. You want the model to inspect the schema, understand rel...