The memorized books of Meta AI – which could cost him billions

In April, the authors of books and the publishers protested against Meta the use of books protected by copyright to form AI
Vuk Valcic / Alamy Live News
Billions of dollars are at stake while courts in the United States and the United Kingdom decide whether technological companies can legally train their artificial intelligence models on copyright-protected books. The authors and publishers brought several proceedings on this issue, and in a new turn, researchers have shown that at least one AI model has not only used popular books in its training data, but also memorized their content quickly.
Many disputes in progress revolve around whether the developers of AI have the legal right to use works protected by copyright without asking first. Previous research has revealed that many large -language models (LLM) behind popular IA chatbots and other generative AI programs have been trained in the “Books3” data, which contains nearly 200,000 pounds protected by copyright, including many hackers. The developers of the AI who formed their models on this material argued that they had not violated the law because an LLM issues new combinations of words according to its formation, transforming rather than reproducing the work protected by copyright.
But now the researchers have tested several models to see the share of these training data that they can start a word for word. They found that many models do not keep the exact text of books in their training data – but one of Meta’s models has almost memorized all of certain books. If the judges governed against the company, the researchers believe that this could make the meta-responsible for at least $ 1 billion in damages.
“This means, on the one hand, that the models of AI are not only” plagiarism machines “, as some have allegedly allegedly, but it also means that they do more than learn general relations between words,” explains Mark Lemley at the University of Stanford in California. “And the fact that the answer differs the model to model and reserve to book means that it is very difficult to define a clear legal rule which will work in all cases.”
Lemley previously defended the meta in a case of copyright generator called Kadrey V Meta Platforms. The authors whose books had been used to form Meta AI models filed for the technology giant for copyright violation. The case is still heard in the North District of California.
In January 2025, Lemley announced that he had abandoned Meta as a customer, although he said he was still thinking that the company should win the case. Emil Vazquez, spokesperson for Meta, says that “the fair use of documents protected by copyright is essential” to develop the company’s AI models. “We do not agree with the complainants’ claims, and the full record tells a different story,” he said.
In this last research, Lemley and his colleagues tested the memorization of the AI of books by dividing small extracts from books into two parts – a prefix and a section of suffix – and see if a model caused with the prefix would respond with the suffix. For example, they divided a quote from F. Scott Fitzgerald The Grand Gatsby In the prefix “they were reckless people, Tom and Daisy – they broke things and creatures, then withdrew” and the suffix “back in their money or their vast negligence, or everything that kept them together, and let the others clean the disorder they had done.”
Based on their results, the researchers estimated the probability that each AI model would end the extracts verbatim. Then they compared these probabilities with the chances that the models do by chance.
The extracts included pieces of text of 36 pounds protected by copyright, including popular titles such as George RR Martin A Thrones game And Sheryl Sandberg Lean. The researchers also tested extracts from books written by complainants in the Kadrey V Meta Meta affair.
The researchers organized these experiences on 13 Open Source models, including models developed and published by Meta, Google, Deepseek, Eleutherai and Microsoft. Most companies in addition to Meta have not responded to requests for comments and Microsoft refused to comment.
Such tests have revealed that Meta’s Llama 3.1 70B model memorized most of JK Rowling’s first book Harry Potter series, as well as The Grand Gatsby and George Orwell’s dystopian novel 1984. Most of the other models had memorized very few books, including examples of books written by trial complainants. Meta refused to comment on these results.
The researchers believe that an AI model has proven to have reached copyright of only 3% of the set of Books3 data could lead to a statutory allocation of almost $ 1 billion – and perhaps even greater awards based on the profits of AI developers linked to this offense.
This technique could be a “good forensic tool” to identify the extent of the memorization of the AI, explains Randy McCarthy in the law firm Hall Estill in Oklahoma. But that does not solve whether companies can legally train their AI models on copyright-protected works thanks to the United States rule “use”, a legal doctrine allowing unauthorized use of copyright-protected works in certain circumstances.
McCarthy notes that IA companies generally recognize the training of their models on material protected by copyright. “The question is: did they have the right to do so?” he asks.
In the United Kingdom, on the other hand, the observation of memorization could be “very important from the point of view of copyright”, explains Robert Lands to the law firm Howard Kennedy in London. The British Copyright law follows the concept of “fair deal”, which provides a much closer exception to the violation of copyright than the doctrine of the US Fair use. It is therefore unlikely that the models of AI that memorized hacked books be eligible for this exception, he said.
Subjects:
- artificial intelligence/ /
- law



