Nvidia’s ISP piracy defense backfires as judge refuses to dismiss copyright lawsuit over more than 197,000 pirated books — scripts in NeMo Framework allegedly ‘have no other purpose’ than to speed up infringement
Source: Tom’s Hardware

U.S. District Judge Jon Tigar denied Nvidia’s request to dismiss a copyright infringement lawsuit. The case alleges that Nvidia’s AI‑powered NeMo Megatron Framework was used to facilitate the illegal downloading and preprocessing of copyrighted eBooks.
The lawsuit
-
Datasets involved:
- Bibliotik – a private eBook torrent tracker containing over 197,000 books.
- Books3 – a dataset that incorporated Bibliotik data.
- The Pile – an 800 GB collection that included Books3 and was used to train Nvidia’s large language models (LLMs).
-
Allegations: Specific scripts within the NeMo Megatron Framework were designed solely to speed up the acquisition and processing of the copyrighted material, giving them “no other purpose” than to facilitate infringement.
-
Judge Tigar’s reasoning: The court distinguished Nvidia’s situation from cases like Sony and Cox, noting that the scripts themselves, not the broader framework, were allegedly intended for infringing use.
Nvidia’s defense
Nvidia argued that the NeMo Megatron Framework has legitimate, non‑infringing uses and cited the Supreme Court’s Cox v. Sony decision, which held that service providers are not automatically liable for users’ piracy. The company claimed that, under precedent, merely providing a service to the public does not constitute copyright infringement.
Related AI copyright cases
-
Meta: Facing a lawsuit alleging the use of pirated material for training its models. Meta has argued that using such material is legal if the content is not directly “seeded” into its products.
-
Google: Advocating for AI‑scraping to be treated as fair use, emphasizing the need for “copyright systems that enable appropriate and fair use” while allowing opt‑outs for data owners.
These cases illustrate the broader legal debate over whether AI developers can rely on existing copyright doctrines when training models on large, often uncurated datasets.