News publishers limit Internet Archive access due to AI scraping concerns
Source: Hacker News
Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers.
As part of its mission to preserve the web, the Internet Archive operates crawlers that capture webpage snapshots. Many of these snapshots are accessible through its public‑facing tool, the Wayback Machine. But as AI bots scavenge the web for training data to feed their models, the Internet Archive’s commitment to free information access has turned its digital library into a potential liability for some news publishers.
The Guardian’s Response
When The Guardian examined who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn (head of business affairs and licensing) on LinkedIn. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.
Actions taken
- Exclude The Guardian from the Internet Archive’s APIs.
- Filter out article pages from the Wayback Machine’s URLs interface.
- Continue to allow regional homepages, topic pages, and other landing pages in the Wayback Machine.
“A lot of these AI businesses are looking for readily available, structured databases of content,” Hahn said. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.”
(He adds that the Wayback Machine itself is “less risky,” since the data is not as well‑structured.)
The Guardian has not documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it is acting proactively and working directly with the Internet Archive to implement the changes. Hahn says the organization has been receptive to The Guardian’s concerns.
“[The decision] was much more about compliance and a backdoor threat to our content,” Hahn explained.
The outlet stopped short of an all‑out block on the Internet Archive’s crawlers because it supports the nonprofit’s mission to democratize information, though that position remains under review as part of its routine bot management.
“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record,” said Internet Archive founder Brewster Kahle, warning that such limits could undercut the organization’s work countering “information disorder.”
Financial Times
The Financial Times blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive. According to director of global public policy and platform strategy Matt Rogerson, the majority of FT stories are paywalled, so usually only unpaywalled FT stories appear in the Wayback Machine because those are meant to be publicly available.
“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”
The New York Times
The New York Times confirmed to Nieman Lab that it is actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times added the crawler archive.org_bot to its robots.txt file, disallowing access to its content.
“We believe in the value of The New York Times’s human‑led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”
Reddit’s Stance
In August, Reddit announced that it would block the Internet Archive, whose digital libraries include countless archived Reddit forums, comment sections, and profiles. This content is similar to what Reddit now licenses to Google as AI training data for tens of millions of dollars.
“[The] Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” a Reddit spokesperson told The Verge. “Until they’re able to defend their site and comply with platform policies…we’re limiting some of their access to Reddit data to protect redditors.”
Internet Archive’s Counter‑Measures
Kahle has alluded to steps the Internet Archive is taking to restrict bulk access to its libraries. In a Mastodon post last fall, he wrote that:
“There are many collections that are available to users but not for bulk downloading. We use internal rate‑limiting systems, filtering mechanisms, and network secu…”
(The statement was truncated in the original source.)
Summary
News publishers are increasingly viewing the Internet Archive’s Wayback Machine as a potential backdoor for AI training data. While some, like The Guardian and The New York Times, are imposing selective blocks or exclusions, others, such as Reddit, are moving toward broader restrictions. The Internet Archive, for its part, is exploring technical safeguards to balance open access with the growing concerns of content owners.
Internet Archive, AI Crawlers, and News Publishers
The Internet Archive’s robots.txt file does not currently disallow any specific crawlers, including those of major AI companies. As of January 12, the file for archive.org read:
“Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!”
Shortly after we inquired about this language, it was changed to simply: “Welcome to the Internet Archive!”
Evidence of Wayback Machine Use in LLM Training
-
An analysis of Google’s C4 dataset by the Washington Post (2023) showed that the Internet Archive was among the millions of websites used to train Google’s T5 model and Meta’s Llama models.
- Out of 15 million domains in the C4 dataset, web.archive.org ranked 187th in frequency.
-
In May 2023, the Internet Archive went offline temporarily after an AI company caused a server overload. Wayback Machine director Mark Graham told Nieman Lab that the company sent “tens of thousands of requests per second from virtual hosts on Amazon Web Services to extract text data from the nonprofit’s public‑domain archives.” The Archive blocked the hosts twice before issuing a public request to “respectfully” scrape its site.
“We got in contact with them. They ended up giving us a donation,” Graham said. “They ended up saying that they were sorry and they stopped doing it.”
“Those wanting to use our materials in bulk should start slowly, and ramp up,” Brewster Kahle wrote in a blog post shortly after the incident. “Also, if you are starting a large project please contact us … we are here to help.”
Publishers’ Robots.txt Policies
The Guardian’s moves to limit the Internet Archive’s access prompted us to examine whether other news publishers were taking similar actions. A website’s robots.txt file tells bots which parts of the site they may crawl, acting like a “doorman.” While not legally binding, it signals where the Archive is unwelcome.
- The New York Times and The Athletic include
archive.org_botin their robots.txt files, though they do not currently disallow other Archive bots.
Methodology
Nieman Lab used journalist Ben Welsh’s database of 1,167 news websites as a starting point. Welsh regularly scrapes the robots.txt files of these outlets. In late December, we downloaded a spreadsheet from his site that listed all bots disallowed in the robots.txt files of those sites.
We identified four bots that the AI‑user‑agent watchdog service Dark Visitors has associated with the Internet Archive (the Archive did not confirm ownership of these bots).
This data is exploratory, not comprehensive. It does not represent global, industry‑wide trends—76 % of sites in Welsh’s list are U.S.-based—but it begins to shed light on which publishers are less eager to have their content crawled by the Internet Archive.
Findings
- 241 news sites from nine countries explicitly disallow at least one of the four Internet Archive crawling bots.
- 87 % of those sites are owned by USA Today Co. (formerly Gannett), which accounts for only 18 % of Welsh’s original publisher list.
- Every Gannett‑owned outlet in our dataset disallows the same two bots:
archive.org_botandia_archiver-web.archive.org. These bots were added to the robots.txt files of Gannett publications in 2025. - Some Gannett sites have taken stronger measures. A URL search for the Des Moines Register in the Wayback Machine returns the message: “Sorry. This URL has been excluded from the Wayback Machine.”
“USA Today Co. has consistently emphasized the importance of safeguarding our content and intellectual property,” a company spokesperson said via email. “Last year, we introduced new protocols to deter unauthorized data collection and scraping, redirecting such activity to a designated page outlining our licensing requirements.”
Gannett declined further comment on its relationship with the Internet Archive. In an October 2025 earnings call, CEO Mike Reed discussed the company’s anti‑scraping measures:
“In September alone, we blocked 75 million AI bots across our local and USA Today platforms, the vast majority of which were seeking to scrape our local content,” Reed said. “About 70 million of those came from OpenAI.”
Gannett signed a content‑licensing agreement with Perplexity in July 2025 (press release: ).
Key Findings
-
93 % (226 sites) of publishers in our dataset block two out of the four Internet Archive bots we identified.
-
Three news sites block three Internet Archive crawlers:
- Le Huffington Post
- Le Monde (French)
- Le Monde (English) – all owned by Group Le Monde.
-
Broader blocking behavior:
- Of the 241 sites that block at least one Internet Archive bot, 240 also block Common Crawl – another nonprofit preservation project that has been more closely linked to commercial LLM development (Wired article).
- 231 sites block bots operated by OpenAI, Google AI, and Common Crawl.
Context
- As we’ve previously reported, the Internet Archive has taken on the Herculean task of preserving the web, while many news organizations lack the resources to archive their own work.
- In December 2025, Poynter announced a joint initiative with the Internet Archive to train local newsrooms on digital preservation (Poynter announcement).
- Archiving projects like this are few and far between; without a federal mandate, the Internet Archive remains the most robust archiving effort in the United States.
Visual Credit
Photo of Internet Archive homepage by SDF_QWE – used under an Adobe Stock license.
About the Author
Andrew Deck – staff writer covering AI at Nieman Lab.
- Tips about AI usage in your newsroom? Reach out:
- Email: andrewdeck@niemanlab.org
- Bluesky: andrewdeck.bsky.social
- Signal: +1 203‑841‑6241
Source: “Network and Perplexity Announce Strategic AI Content Licensing Agreement” (link truncated for brevity).