Internet Increasingly Becoming Unarchivable

Published: (February 14, 2026 at 01:46 PM EST)
9 min read

Source: Hacker News

Digital Archives, AI Crawlers, and News Publishers

Outlets such as The Guardian and The New York Times are scrutinising digital archives as potential back‑doors for AI crawlers.

As part of its mission to preserve the web, the Internet Archive operates crawlers that capture webpage snapshots. Many of these snapshots are accessible through its public‑facing tool, the Wayback Machine. But as AI bots scour the web for training data, the Archive’s commitment to free‑information access has turned its digital library into a potential liability for some news publishers.

The Guardian’s Response

  • Discovery: Access logs showed the Internet Archive was a frequent crawler of Guardian content.
  • Quote: “A lot of these AI businesses are looking for readily available, structured databases of content. The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.” – Robert Hahn, head of business affairs and licensing (via his LinkedIn profile).
  • Actions Taken:
    1. API Exclusion – The Guardian has been removed from the Internet Archive’s APIs.
    2. URL Filtering – Article pages are filtered out of the Wayback Machine’s URLs interface.
    3. Retention of Non‑Article Pages – Regional homepages, topic pages, and other landing pages remain available.

“The decision was much more about compliance and a back‑door threat to our content,” Hahn added.

The Guardian stopped short of a total block because it still supports the nonprofit’s mission to democratise information, though the policy remains under review.

Financial Times (FT)

  • Blocking Policy: The FT blocks any bot that attempts to scrape its pay‑walled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive.
  • Result: Only un‑pay‑walled FT stories appear in the Wayback Machine, as those are already public.
  • Quote: “The majority of FT stories are pay‑walled,” says Matt Rogerson, director of global public policy and platform strategy (LinkedIn).

Expert Commentary

“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI. In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.” – Michael Nelson, computer scientist and professor at Old Dominion University (LinkedIn).

Other Publishers Taking Action

PublisherActionRationale
The New York Times“Hard blocking” of the Internet Archive’s crawlers; added archive.org_bot to its robots.txt (as of end‑2025).Prevent unfettered AI access to Times content.
RedditBlocked the Internet Archive’s access to Reddit data.Protect users after AI companies scraped Wayback Machine data in violation of platform policies.
Other outletsOngoing reviews of bot‑management policies.Guard intellectual property and limit AI training‑data extraction.

NYT spokesperson: “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”

Reddit spokesperson (as quoted by The Verge): “The Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine. Until they’re able to defend their site and comply with platform policies…we’re limiting some of their access to Reddit data to protect redditors.”

Internet Archive’s Position

  • Founder’s View: Brewster Kahle warned that limiting libraries like the Internet Archive would reduce public access to the historical record and could undercut efforts to combat “information disorder”.
  • Technical Measures: In a Mastodon post last fall, Kahle noted that many collections are available to users but not for bulk downloading. The Archive employs:
    • Internal rate‑limiting systems
    • Filtering mechanisms
    • Network security controls

These steps aim to balance open access with the need to prevent large‑scale data harvesting for AI training.

Summary

  • News publishers are increasingly restricting the Internet Archive’s crawlers to protect their copyrighted content from being harvested by AI companies.
  • The Guardian and NYT have taken targeted blocks, while the FT and Reddit have broader prohibitions.
  • The Internet Archive acknowledges the concerns and is implementing rate‑limiting and access controls, but its founder cautions that overly restrictive policies could harm the public’s ability to access historical web content.

Robots.txt and the Internet Archive

  • The Internet Archive’s robots.txt does not disallow any specific crawlers, including those of major AI companies.

  • As of January 12, the file for archive.org read:

    Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!

    Shortly after this language was queried, it was changed to simply:

    Welcome to the Internet Archive!

Evidence of Wayback Machine Use in LLM Training

  • An analysis of Google’s C4 dataset (Washington Post, 2023) showed the Internet Archive was among the millions of websites used to train Google’s T5 model and Meta’s Llama models.
  • Out of 15 million domains in C4, web.archive.org ranked 187th in frequency.

AI‑Induced Outage (May 2023)

  • The Archive went offline after an AI company generated a server overload.
  • According to Wayback Machine director Mark Graham (Nieman Lab, Fall 2023), the company sent tens of thousands of requests per second from AWS virtual hosts to extract text data.
  • The Archive blocked the hosts twice, then issued a public request to “respectfully” scrape the site.

“We got in contact with them. They ended up giving us a donation,” Graham said. “They ended up saying that they were sorry and they stopped doing it.”

  • Brewster Kahle wrote in a blog post shortly after the incident:

    “Those wanting to use our materials in bulk should start slowly, and ramp up. Also, if you are starting a large project please contact us … we are here to help.”

Investigating Publisher Robots.txt Files

  • The Guardian’s move to limit the Archive prompted a broader look at news publishers’ robots.txt files.
  • A robots.txt file acts like a “doorman,” indicating which parts of a site bots may crawl. While not legally binding, it signals where the Archive is unwelcome.

Example: Hard Blocking

  • The New York Times and The Athletic include archive.org_bot in their robots.txt files, though they do not currently block other Archive bots.

Data Source

  • Nieman Lab used journalist Ben Welsh’s database of 1,167 news websites as a starting point.
  • Welsh regularly scrapes the robots.txt files of these outlets.
  • In late December, a spreadsheet from Welsh’s site listed all bots disallowed by each site.

Identified Archive‑Related Bots

Four bots associated with the Internet Archive (via the AI user‑agent watchdog Dark Visitors) were examined. (The Archive did not respond to requests to confirm ownership of these bots.)

Note: This data is exploratory, not comprehensive. It reflects a U.S.-centric sample (≈ 76 % of sites are U.S.-based).

Findings

  • 241 news sites across nine countries explicitly disallow at least one of the four Archive bots.

  • 87 % of those sites are owned by USA Today Co. (formerly Gannett). Gannett sites constitute only 18 % of Welsh’s original list.

  • Every Gannett‑owned outlet disallows the same two bots:

    1. archive.org_bot
    2. ia_archiver-web.archive.org

    These entries were added to Gannett robots.txt files in 2025.

  • Some Gannett sites employ stronger measures. For example, a search for the Des Moines Register in the Wayback Machine returns:

    “Sorry. This URL has been excluded from the Wayback Machine.”

Gannett’s Public Statements

  • A company spokesperson said via email:

    “USA Today Co. has consistently emphasized the importance of safeguarding our content and intellectual property. Last year, we introduced new protocols to deter unauthorized data collection and scraping, redirecting such activity to a designated page outlining our licensing requirements.”

  • Gannett declined further comment on its relationship with the Internet Archive.

  • In an October 2025 earnings call, CEO Mike Reed discussed anti‑scraping measures:

    “In September alone, we blocked 75 million AI bots across our local and USA Today platforms, the vast majority of which were seeking to scrape our local content. About 70 million of those came from OpenAI.”

  • Gannett signed a content‑licensing agreement with Perplexity in July 2025.

Internet Archive Bot Blocking by News Sites

“The Internet Archive tends to be good citizens. It’s the law of unintended consequences: you do something for really good purposes, and it gets abused.” – Robert Hahn

Key Findings

  • 93 % (226 sites) of publishers in our dataset disallow two out of the four Internet Archive bots we identified.
  • Three news sites disallow three Internet Archive crawlers: Le Huffington Post, Le Monde, and Le Monde in English (all owned by Group Le Monde).

Broader Blocking Patterns

  • Out of the 241 sites that block at least one Internet Archive bot, 240 also block Common Crawl – another nonprofit internet‑preservation project that has been more closely linked to commercial LLM development (see Wired).
  • 231 sites block bots operated by OpenAI, Google AI, and Common Crawl.

Context

  • As previously reported, the Internet Archive has taken on the Herculean task of preserving the web, while many news organizations lack the resources to archive their own work.

  • In December, Poynter announced a joint initiative with the Internet Archive to train local news outlets on preserving their content.

  • Archiving initiatives like this are few and far between; there is no federal mandate requiring internet content preservation, making the Internet Archive the most robust U.S. archiving effort.

Photo Credit

  • Internet Archive homepage – Photo by SDF_QWE (Adobe Stock).
    • License: Adobe Stock

About the Author

Andrew Deck – staff writer covering AI at Nieman Lab.

References

  1. Gannett & USA TODAY Network – Strategic AI Content Licensing Agreement
  2. Wired – “The Fight Against AI Comes to a Foundational Data Set”
  3. Nieman Lab – “The Wayback Machine’s Snapshots of News Homepages Plummet”

(All links are active as of February 2026.)

0 views
Back to Blog

Related posts

Read more »