Microsoft deletes blog telling users to train AI on pirated Harry Potter books

Published: (February 20, 2026 at 07:11 AM EST)
7 min read

Source: Ars Technica

Microsoft’s Harry Potter Dataset Controversy

Summary – After a backlash on Hacker News, Microsoft removed a blog post that appeared to encourage developers to use a (mistakenly‑labeled) “public‑domain” Harry Potter dataset to train generative‑AI models. The post linked to a Kaggle collection of all seven books, which was later deleted.


Background

  • Blog post: LangChain with SQLVectorStore example (Nov 2024) – authored by senior product manager Pooja Kamath.
  • Purpose: Promote a new Azure SQL DB feature that lets developers add generative‑AI capabilities with a few lines of code (Azure SQL DB + LangChain + LLMs).
  • Claim: Using a “well‑known dataset” such as the Harry Potter books would provide “engaging and relatable examples” for customers.

“The books are one of the most famous and cherished series in literary history… fans could build Q&A systems or generate new AI‑driven Harry Potter fan fiction.” – Microsoft blog


The Dataset Issue

  • The blog linked to a Kaggle dataset that contained all seven Harry Potter novels.
  • The dataset was incorrectly marked “public domain.”
  • Kaggle’s terms state that rights‑holders can submit infringement notices and that repeat offenders may be suspended.

How the mistake was discovered

  • Hacker News users highlighted the problem in a thread: .
  • Ars Technica investigated and found that the dataset had been downloaded only ~10 000 times, likely escaping the author’s notice.

Response from the uploader

“The dataset was marked as Public Domain by mistake. There was no intention to misrepresent the licensing status of the works.” – Shubham Maindola, data scientist in India (the uploader).


Cathay Y. N. Smith, law professor and co‑director of Chicago‑Kent College of Law’s Program in Intellectual Property Law, explained:

“Someone might be really knowledgeable about books and technology, but not necessarily about copyright terms and how long they last… especially if she saw that something was marked by another reputable company as being public domain.”


Aftermath

  • Microsoft: Declined to comment on the incident.
  • Kaggle: Did not respond to requests for comment.
  • Dataset: Removed from Kaggle after Ars Technica contacted the uploader.

Key Takeaways

  1. Verify licensing – Even if a dataset is labeled “public domain,” double‑check the copyright status, especially for recent works.
  2. Corporate responsibility – Companies promoting AI tools should ensure that example data complies with intellectual‑property law.
  3. Community vigilance – Public scrutiny (e.g., on Hacker News) can surface compliance issues that might otherwise go unnoticed.

For the archived version of the original Microsoft blog post, see the Wayback Machine snapshot:

Microsoft Was “Probably Smart” to Pull the Blog

On Hacker News, commenters argued that anyone familiar with the popular franchise would know the Harry Potter books are not in the public domain. They debated whether Microsoft’s blog was “problematic copyright‑wise,” since Microsoft not only encouraged customers to download the infringing material but also used the books themselves to create Harry Potter AI models that relied on beloved characters to hype Microsoft products.

Microsoft’s blog was posted more than a year ago, at a time when AI firms began facing lawsuits over models that allegedly infringed copyrights by training on pirated material and regurgitating works verbatim.

The post recommended that users learn to train their own AI models by:

  1. Downloading the Harry Potter dataset.
  2. Uploading the text files to Azure Blob Storage.

It included example models based on a dataset that Microsoft seemingly uploaded to Azure Blob Storage, which contained only the first book, Harry Potter and the Sorcerer’s Stone.


Example Use Cases Described in the Blog

  • Q&A systems – By training large language models (LLMs) on the text files, fans could create Q&A bots that pull up relevant excerpts.

    • Query: “Wizarding World snacks” → Returns a passage from The Sorcerer’s Stone where Harry marvels at Bertie Bott’s Every‑Flavor Beans and chocolate frogs.
    • Prompt: “How did Harry feel when he first learned that he was a wizard?” → Generates an answer pointing to early excerpts in the book.
  • Fan‑fiction generation – Kamath suggested using the model to “explore new adventures” and even “create alternate endings.” The model could quickly locate “contextually similar” excerpts and combine them with new text to produce fresh stories that fit the existing narrative.

    As an illustration, Kamath trained a model to write a story she could use to market the feature she was blogging about. She asked the model to write a tale in which Harry meets a new friend on the Hogwarts Express who explains Microsoft’s Native Vector Support in SQL “in the Muggle world.”

    The generated fan‑fiction stitched together passages from The Sorcerer’s Stone (e.g., Harry learning about Quidditch and meeting Hermione) with a sales pitch that likened the new feature to a spell that instantly finds exactly what you need among thousands of options—perfect for machine‑learning, AI, and recommendation systems.

  • Image generation – Kamath also produced an image showing Harry with his new friend, stamped with a Microsoft logo, further blurring the line between the two brands.


“I think that the regurgitation and the creation of fan fiction could both flag copyright issues. Fan fiction often has to take from expressive elements—a copyrighted character that’s famous enough to be protected by copyright law or plot sequences. If these things are copied and reproduced, then that output could be potentially infringing,”
Smith, speaking to Ars Technica.

Smith added that the situation remains a gray area:

“Looking at the blog, I would be concerned, but I wouldn’t say it’s automatically infringement.”

He also noted that Microsoft’s decision to pull the blog was likely prudent:

“Courts have generally said that training AI on copyrighted books is fair use, but they continue to probe questions about pirated AI training materials,”
Smith, Ars Technica.


Origin of the Dataset

On the now‑deleted Kaggle dataset page, Maindola explained how the data were sourced:

“I downloaded the e‑books and then converted them to .txt files.”


This cleaned version preserves the original information while improving readability and adhering to proper Markdown conventions.

Microsoft May Have Infringed Copyrights

Source: Ars Technica – “Microsoft may have infringed copyrights”


Key Points

  • Potential liability – If Microsoft knowingly used pirated books to train its models, “fair use could be a difficult argument,” says copyright expert Smith.
  • Fair‑use arguments – Some Hacker News commenters argue the blog was for “educational purposes,” giving Microsoft “good arguments” in its defense.
  • Secondary contributory liability – Smith warns that Microsoft could be liable for “contributory copyright infringement” by distributing a Kaggle dataset that was downloaded >10,000 times before it was removed.

“The ultimate result is to create something infringing by saying, ‘Hey, here you go, go grab that infringing stuff and use that in our system,’” – Smith


Reactions on Hacker News

CommenterSummary
Former Microsoft employeeCriticized the lack of editorial approval: “It looks like somebody made a bad judgment call … and it was taken down as soon as someone noticed.”
Other usersBlamed the Kaggle uploader Maindola for labeling the dataset “public domain.” Others counter‑argued that Microsoft’s staff should have known the works were copyrighted.
On the datasetThe blog also linked to an Azure sample containing Isaac Asimov’s Foundation series, another non‑public‑domain work.
Fair‑use defendersSuggested the blog could be considered fair use, especially for nonprofit or educational contexts.

“Microsoft could have used any dataset for their blog, even actual public‑domain novels,” one commenter wrote.


  • Azure sample (Foundation series):
  • Original Ars Technica article (author page):

Author Bio

Photo of Ashley Belanger

Ashley Belanger – Senior policy reporter for Ars Technica. She tracks the social impacts of emerging policies and new technologies. Based in Chicago, she has 20 years of journalism experience.


Comments

  • 79 comments (as of the time of writing).

This markdown has been cleaned for readability and proper formatting while preserving the original content.

0 views
Back to Blog

Related posts

Read more »