Home Kripto New Study Suggests OpenAI Models ‘Memorized’ Copyrighted Content
Kripto

New Study Suggests OpenAI Models ‘Memorized’ Copyrighted Content

New Study Suggests OpenAI Models ‘Memorized’ Copyrighted Content

A new study has emerged, suggesting that OpenAI’s models may have “memorized” portions of copyrighted content, lending weight to ongoing allegations from authors, programmers, and rights holders. These groups have accused OpenAI of using their works—such as books, codebases, and articles—to train its AI models without permission.

OpenAI has long defended itself using the fair use clause, claiming that training on publicly available data falls under this provision. However, critics argue that U.S. copyright law does not include exceptions for training data, sparking legal battles over the practice.

Method of Identifying “Memorized” Content

The study, co-authored by researchers from the University of Washington, University of Copenhagen, and Stanford University, proposes a novel method for identifying training data “memorized” by AI models like OpenAI’s. This method involves focusing on “high-surprisal” words—uncommon words in a given context. For example, in the sentence “Jack and I sat perfectly still with the radar humming,” the word “radar” would be considered high-surprisal due to its rarity in comparison to words like “engine” or “radio.”

By removing high-surprisal words from excerpts of texts such as fiction books and New York Times articles, and then asking the models to predict the masked words, the researchers found that OpenAI’s GPT-4 model showed signs of memorizing passages from popular books and articles. Specifically, GPT-4 appeared to have memorized parts of books from a dataset called BookMIA, which contains copyrighted e-books, and had a lower rate of memorization in New York Times articles.

Implications for AI Transparency and Accountability

Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized the need for more transparency in AI training practices. She pointed out that for AI models to be trustworthy, they must be fully auditable and verifiable, especially regarding the data they are trained on.

While OpenAI has advocated for looser restrictions on the use of copyrighted data in model development, it has also implemented content licensing agreements and opt-out mechanisms for copyright holders. Despite these measures, OpenAI has lobbied governments worldwide to codify fair use rules specific to AI training.

What The Author Thinks

In my opinion, the findings of this study highlight a critical issue for AI development—transparency. While OpenAI defends its approach to training models, the possibility of unintentionally memorizing copyrighted content raises serious questions about intellectual property rights in the AI space. For AI to continue advancing, clear rules must be established to ensure that models are not just innovative but also ethically trained and transparent.

Related Articles

Adobe’s Firefly Now Available on iOS and Android
Kripto

Adobe’s Firefly Now Available on iOS and Android

Adobe continues its push to become the go-to platform for AI-powered creative...

Tesla Full-Self Driving Tests Reveal Dangers: Speeds Past Stopped School Bus, Strikes Dummy Kids
Kripto

Tesla Full-Self Driving Tests Reveal Dangers: Speeds Past Stopped School Bus, Strikes Dummy Kids

Third-party testing conducted by The Dawn Project and partners has revealed serious...

Trump Rejects Israeli Proposal to Target Iran’s Supreme Leader, Say US Officials
Kripto

Trump Rejects Israeli Proposal to Target Iran’s Supreme Leader, Say US Officials

Amid escalating tensions between Israel and Iran, President Donald Trump opposed an...

Tinder Now Lets You Arrange Double Dates with Friends
Kripto

Tinder Now Lets You Arrange Double Dates with Friends

In response to declining user engagement, Tinder has introduced a new feature...