Home Kripto New Study Suggests OpenAI Models ‘Memorized’ Copyrighted Content
Kripto

New Study Suggests OpenAI Models ‘Memorized’ Copyrighted Content

New Study Suggests OpenAI Models ‘Memorized’ Copyrighted Content

A new study has emerged, suggesting that OpenAI’s models may have “memorized” portions of copyrighted content, lending weight to ongoing allegations from authors, programmers, and rights holders. These groups have accused OpenAI of using their works—such as books, codebases, and articles—to train its AI models without permission.

OpenAI has long defended itself using the fair use clause, claiming that training on publicly available data falls under this provision. However, critics argue that U.S. copyright law does not include exceptions for training data, sparking legal battles over the practice.

Method of Identifying “Memorized” Content

The study, co-authored by researchers from the University of Washington, University of Copenhagen, and Stanford University, proposes a novel method for identifying training data “memorized” by AI models like OpenAI’s. This method involves focusing on “high-surprisal” words—uncommon words in a given context. For example, in the sentence “Jack and I sat perfectly still with the radar humming,” the word “radar” would be considered high-surprisal due to its rarity in comparison to words like “engine” or “radio.”

By removing high-surprisal words from excerpts of texts such as fiction books and New York Times articles, and then asking the models to predict the masked words, the researchers found that OpenAI’s GPT-4 model showed signs of memorizing passages from popular books and articles. Specifically, GPT-4 appeared to have memorized parts of books from a dataset called BookMIA, which contains copyrighted e-books, and had a lower rate of memorization in New York Times articles.

Implications for AI Transparency and Accountability

Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized the need for more transparency in AI training practices. She pointed out that for AI models to be trustworthy, they must be fully auditable and verifiable, especially regarding the data they are trained on.

While OpenAI has advocated for looser restrictions on the use of copyrighted data in model development, it has also implemented content licensing agreements and opt-out mechanisms for copyright holders. Despite these measures, OpenAI has lobbied governments worldwide to codify fair use rules specific to AI training.

What The Author Thinks

In my opinion, the findings of this study highlight a critical issue for AI development—transparency. While OpenAI defends its approach to training models, the possibility of unintentionally memorizing copyrighted content raises serious questions about intellectual property rights in the AI space. For AI to continue advancing, clear rules must be established to ensure that models are not just innovative but also ethically trained and transparent.

Related Articles

Leaked Memos Show SpaceX’s Starlink as a Major Beneficiary of Trump’s Tariff Trade War
Kripto

Leaked Memos Show SpaceX’s Starlink as a Major Beneficiary of Trump’s Tariff Trade War

In a surprising move, the United States government recently whacked Lesotho with...

iPhone Maker Foxconn Announces Plan to Build Mitsubishi EVs
Kripto

iPhone Maker Foxconn Announces Plan to Build Mitsubishi EVs

Foxconn, the world’s largest contract electronics manufacturer, has made further moves into...

Vance Claims Russia is Demanding Too Much, While Trump Urges Quick Decisions
Kripto

Vance Claims Russia is Demanding Too Much, While Trump Urges Quick Decisions

On May 7, JD Vance, a newly elected U.S. Senator from Ohio,...

Trump Sticks to High China Tariffs Ahead of Trade Talks
Kripto

Trump Sticks to High China Tariffs Ahead of Trade Talks

Yet on Wednesday, standing from the White House, Donald Trump doubled down...