New ‘Open Source AI Definition’ Criticized for Not Opening Training Data


Long-time Slashdot reader samj — also a long-time Debian developertells us there’s some opposition to the newly-released Open Source AI definition. He calls it a “fork” that undermines the original Open Source definition (which was originally derived from Debian’s Free Software Guidelines, written primarily by Bruce Perens), and points us to a new domain with a petition declaring that instead Open Source shall be defined “solely by the Open Source Definition version 1.9. Any amendments or new definitions shall only be recognized with clear community consensus via an open and transparent process.”

This move follows some discussion on the Debian mailing list:


Allowing “Open Source AI” to hide their training data is nothing but setting up a “data barrier” protecting the monopoly, disabling anybody other than the first party to reproduce or replicate an AI. Once passed, OSI is making a historical mistake towards the FOSS ecosystem.

They’re not the only ones worried about data. This week TechCrunch noted an August study which “found that many ‘open source’ models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex. Instead of democratizing AI, these ‘open source’ projects tend to entrench and expand centralized power, the study’s authors concluded.”

samj shares the concern about training data, arguing that training data is the source code and that this new definition has real-world consequences. (On a personal note, he says it “poses an existential threat to our pAI-OS project at the non-profit Kwaai Open Source Lab I volunteer at, so we’ve been very active in pushing back past few weeks.”)

And he also came up with a detailed response by asking ChatGPT. What would be the implications of a Debian disavowing the OSI’s Open Source AI definition? ChatGPT composed a 7-point, 14-paragraph response, concluding that this level of opposition would “create challenges for AI developers regarding licensing. It might also lead to a fragmentation of the open-source community into factions with differing views on how AI should be governed under open-source rules.” But “Ultimately, it could spur the creation of alternative definitions or movements aimed at maintaining stricter adherence to the traditional tenets of software freedom in the AI age.”

However the official FAQ for the new Open Source AI definition argues that training data “does not equate to a software source code.”

Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model…. [F]orks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data…

[W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

Read on for the rest of their response…


“There are also many cases where terms of use of publicly-available data may give entity A the confidence that they may use it freely and call it “open data”, but not give entity A the confidence they can give entity B guarantees in a different jurisdiction. Meanwhile, entity B may or may not feel confident to use that data in their jurisdiction. An example is so-called public domain data, where the definition of public domain varies country-by-country. Another example is fair-use or private data where the finding of fair use or privacy laws may require a good knowledge of the law of a given jurisdiction. This resharing is not so much limited as lacking legal certainty

“Some people believe that full unfettered access to all training data (with no distinction of its kind) is paramount, arguing that anything less would compromise full reproducibility of AI systems, transparency and security. This approach would relegate Open Source AI to a niche of AI trainable only on open data… That niche would be tiny, even relative to the niche occupied by Open Source in the traditional software ecosystem. The requirements of Data Information keep the same approach present in the Open Source Definition that doesn’t mandate full reproducibility and transparency but enables them (i.e. reproducible builds). At the same time, setting a baseline requiring Data Information doesn’t preclude others from formulating and demanding more requirements, like the Digital Public Goods Standard or the Free Systems Distribution Guidelines add requirements to the Open Source Definition.

“One of the key aspects of OSI’s mission is to drive and promote Open Source innovation. The approach OSI takes here enables full user choice with Open Source AI. Users can keep the insights derived from training+data pre-processing code and description of unshareable training data and build upon those with their own unshareable data and give the insights derived from further training to everyone, allowing for Open Source AI in areas like healthcare. Or users can obtain the available and public data from the Data Information and retrain their model without any unshareable data resulting in more data transparency in the resulting AI system. Just like with copyleft and permissive licensing, this approach leaves the choice with the user…

“This approach both advances openness in all the components of the AI system and drives more Open Source AI, i.e. in private-first areas such as healthcare.”



Source link