Sunday , 13 July 2025

Home Kripto Microsoft Ventures into AI Training Data Attribution with New Research Project

Kripto

Microsoft Ventures into AI Training Data Attribution with New Research Project

ByLivevartha2025-03-262 Mins read159 Views

Now, Microsoft has launched a bold new research project that involves something they’re calling “training-time provenance.” This effort seeks to address the complex issue of AI training data attribution. This project is centered on understanding the effects of particular training samples on the generative outputs of large AI models. These outputs are text-embedded images. Jaron Lanier, Microsoft Research’s resident techno-evangelist, is in charge of the project. Designed to be an AI Ethics Chatbot, it seeks to address all the multifaceted AI Ethics issues and Copyright issues.

This effort couldn’t be timelier given that Microsoft is already in hot water for violating the rights of copyright holders on multiple fronts. In December, The New York Times filed a lawsuit against Microsoft and OpenAI. They claim that the generative AI models trained on copyright material taken from the publication’s 130-year archive of articles. At least five software developers have sued Microsoft for training their coding assistant GitHub Copilot unlawfully. Now they want you to believe that it infringed their copyrighted works. These legal hurdles underscore the urgency for Microsoft to address the ethical implications of using copyrighted material in AI training.

“A data-dignity approach would trace the most unique and influential contributors when a big model provides a valuable output,” said Jaron Lanier, highlighting one of the project’s core objectives.

Ethical Concerns and the Push for Contributor Recognition

Many people are already interpreting Microsoft’s effort as a move to “ethics wash” the issues around this debate. This litigation / controversy concerns the training of AI models on copyrighted materials, especially artistic works. The tech giant is making a big push to develop an internal contributor recognition system. This would be a huge step forward in addressing the ethical and legal challenges that we currently face.

“For instance, if you ask a model for ‘an animated movie of my kids in an oil-painting world of talking cats on an adventure,’ then certain key oil painters, cat portraitists, voice actors, and writers — or their estates — might be calculated to have been uniquely essential to the creation of the new masterpiece. They would be acknowledged and motivated. They might even get paid,” Lanier explained, illustrating the potential impact of the project.

Despite its impressive aims, the “training-time provenance” project may still prove to be just a proof of concept. Still, it’s in step with a larger industry push to promote innovation while being ethically responsible. Bria, an AI model developer, already claims to “programmatically” compensate data owners based on their influence and recently secured $40 million in venture capital funding. Firms like Adobe and Shutterstock have released infrastructure to compensate data set creators. How much this payout will be isn’t yet clear.

OpenAI, for better or worse, is in an intense technological arms race. This would give creators greater control over how their works are used in, or out of, training datasets. The company has been lying in wait, lobbying the U.S. government to implement blanket copyright fair use for training models. These guidelines would provide the much-needed legal clarity for communities to tackle these relevant issues.

“Current neural network architectures are opaque in terms of providing sources for their generations, and there are […] good reasons to change this,” stated Microsoft in a job listing, signaling its commitment to enhancing transparency in AI training processes.

What The Author Thinks

Microsoft’s focus on transparency and attribution in AI training data is a necessary step in the right direction. If companies like Microsoft are serious about building responsible AI, they must prioritize ethical frameworks and legal clarity. It is essential to ensure that creators are properly credited and compensated for their contributions to training data, and this project is a positive move toward achieving that goal.