robots.txt is a sign, not a fence: 8 technical vectors through which AI still reads your website
Source: Dev.to
Introduction
You configure robots.txt to block every known bot:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: *
Disallow: /You also enable Cloudflare Bot Management, set up Akamai, and maybe even a server‑side paywall. Yet when you query ChatGPT about your product, it still cites your website as a source.
I work on GEO (Generative Engine Optimization) projects, auditing how large language models (LLMs) represent brands. Across thousands of prompt‑response pairs we consistently find that 10–20 % of LLM answers cite the brand’s own website—even when every known bot is blocked.
Below are the 8 technical vectors we documented, with academic sources and industry data.
1. Historical Web Archives (Common Crawl)
- Scale: 9.5 + petabytes, 300 + billion documents.
- Usage: ~2/3 of the 47 LLMs published between 2019–2023 use Common Crawl as training data (GPT‑3, LLaMA, T5, Red Pajama, etc.).
- Google’s C4 dataset: 750 GB filtered from Common Crawl.
Source: ACM FAccT 2024 – “A Critical Analysis of Common Crawl”.
Key point – Blocking crawlers today does not retroactively remove content already captured. Those snapshots are permanent, public resources.
JavaScript Paywalls & Common Crawl
Common Crawl does not execute JavaScript. If your paywall depends on client‑side JS, the crawler still captures the full HTML.
document.addEventListener('DOMContentLoaded', () => {
showPaywall();
});Alex Reisner documented this for The Atlantic (Nov 2025): Common Crawl was capturing full articles from NYT, WSJ, The Economist, and The Atlantic itself.
2. Bot Identity Spoofing
Some AI bots change their user‑agent or IP when blocked.
Cloudflare (Aug 2024) reported that Perplexity sent:
# Declared user-agent PerplexityBot/1.0 # What they actually sent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0They also rotate ASNs to evade IP‑based blocking.
The evasion ecosystem includes FlareSolverr (Selenium + undetected‑chromedriver), Scrapfly (94–98 % bypass rates), and residential proxy rotation.
3. Syndication Channels Bypass robots.txt
Once your content leaves your domain, robots.txt no longer applies.
Original domain (robots.txt: Disallow)
→ RSS feed (no robots.txt)
→ Apple News (different domain)
→ Email newsletter (archived on web)
→ Cross‑posted to social (scraped by bots)
→ API aggregators (reformatted downstream)Each channel creates a copy outside your control.
Internet Archive
- 1 + billion pages, 99 + petabytes.
web.archive.orgis domain #187 in Google’s C4 dataset.- As of Feb 2026, publishers like The Guardian and NYT began blocking the Wayback Machine over AI concerns (Harvard’s WARC‑GPT can ingest WARC archives directly into RAG pipelines).
4. Real‑Time Fetching by Modern LLMs
| Bot | Growth 2024–2025 | Mechanism |
|---|---|---|
| ChatGPT‑User | +2,825 % | Fetch on user “search the web” |
| PerplexityBot | +157,490 % | Fetch on every query |
| Meta‑ExternalFetcher | New in 2024 | Meta AI features |
These bots claim the fetch is user‑initiated (not autonomous crawling) to argue they are exempt from robots.txt.
- Cloudflare reported Anthropic’s bots have crawl‑to‑refer ratios of 38,000:1 to 70,000:1.
- Sources: Cloudflare Blog 2025; OpenAI Crawlers Overview.
5. Content Farms & Rewrites
Human or AI‑operated farms copy and rewrite your articles on unrestricted domains:
- Scrape the original article.
- Rewrite to avoid plagiarism detection.
- Publish on a domain with no
robots.txtrestrictions. - AI crawlers index the rewrite.
- LLMs absorb the rewritten version.
In Bartz v. Anthropic PBC, the court ruled that training AI with content from “pirate sites” constituted fair use, setting a precedent for rewritten content.
6. Bots Ignoring robots.txt
- 12.9 % of bots ignore
robots.txtentirely (up from 3.3 %). — Paul Calvano, Aug 2025 - Duke University (2025): “Several categories of AI‑related crawlers never request
robots.txt.” - Kim & Bock (ACM IMC 2025): Scrapers are less likely to comply with more restrictive directives.
Legal Perspective
In Ziff Davis v. OpenAI (2025), the judge described robots.txt as “more like a sign than a fence”—not a technological measure that “effectively controls access” under the DMCA.
7. Metrics Overview
| Metric | Value | Source |
|---|---|---|
Bots ignoring robots.txt | 12.9 % | Paul Calvano, 2025 |
| Top 10K sites with AI bot rules | Only 14 % | Market analysis 2025 |
Sites with any robots.txt | 94 % (12.2 M sites) | Global study 2025 |
8. Mitigation Strategies
- Defensive measures (e.g., stricter bot management) reduce direct crawling by 40–60 % for compliant bots, but they cannot affect historical data, syndicated copies, or content‑farm rewrites.
- Offensive approach: control the narrative rather than trying to hide.
At 498 Advance we built:
- GEOdoctor – technical auditing of brand visibility in LLMs.
- S.A.M. (Semantic Alignment Machine) – content alignment across owned media, UGC platforms (social GEO), and authority domains.
Full analysis with all academic sources:
Conclusion
Blocking everything with robots.txt and bot‑management tools is no longer sufficient. Historical archives, real‑time fetches, syndication, and content farms ensure that your website’s content can still appear in LLM outputs. The most effective strategy is to manage and align your narrative across all channels, rather than relying on a “sign” to keep bots out.
Have you encountered this paradox—blocking everything yet still appearing in LLM outputs? Feel free to share your observations.