AIs can generate near-verbatim copies of novels from training data
Source: Hacker News
Legal developments
-
A U.S. court last year found that Anthropic’s training of large language models on some copyrighted content could be considered fair use because it was deemed “transformative.” However, the court determined that storing pirated works was “inherently, irredeemably infringing,” leading the AI group to pay $1.5 billion to settle the lawsuit [source].
-
In Germany, a November‑last‑year ruling concluded that OpenAI had infringed copyright because its model had memorized song lyrics. The case, brought by GEMA (an association representing composers, lyricists, and publishers), was hailed as a landmark decision in the EU.
Expert commentary
-
Rudy Telscher, partner at Husch Blackwell, said reproducing an entire book without jailbreaking is “clearly a copyright violation.” He added that the key question is whether this occurs frequently enough for AI models to be vicariously liable for the infringement.
-
Anthropic argued that the jailbreaking technique used in the Stanford and Yale research is impractical for ordinary users and would require more effort to extract text than simply purchasing the content. The company also emphasized that its model does not store copies of specific datasets but learns from patterns and relationships between words and strings in its training data.
-
The fact that AI labs have implemented safeguards to prevent training data from being extracted indicates they are aware of the problem, according to Imperial College’s de Montjoye.
-
Ben Zhao, computer‑science professor at the University of Chicago, questioned whether AI labs need to use copyrighted material to build cutting‑edge models. “Whether the technical result can be done or not, it’s still a question of should we be doing this?” Zhao said. “The legal side should eventually hold their ground and really be the arbiter in this whole process.”
Company responses
- xAI, OpenAI, and Google did not respond to requests for comment.