Open-source AI definition finally gets its first release candidate – and a compromise


Olemedia/Getty Images

Getting open-source and artificial intelligence (AI) on the same page isn’t easy. Just ask the Open Source Initiative (OSI). The OSI, the open-source definition steward organization, has been working on creating an open-source artificial intelligence definition for two years now. The group has been making progress, though. Its Open Source AI Definition has now released its first release candidate, RC1. 

Also: Can AI even be open source? It’s complicated

The latest definition aims to clarify the often contentious discussions surrounding open-source AI. It specifies four fundamental freedoms that an AI system must grant to be considered open source: the ability to use the system for any purpose without permission, to study how it works, to modify it for any purpose, and to share it with or without modifications.

So far, so good. 

Stefano Maffulli, OSI Executive Director

Stefano Maffulli, the OSI’s executive director

The Open Source Initiatibve

However, the OSI has opted for a compromise regarding training data. Recognizing it’s not easy to share full datasets, the current definition requires “sufficiently detailed information about the data used to train the system” rather than the full dataset itself. This approach aims to balance transparency with practical and legal considerations.

That last phrase is proving difficult for some people to swallow. From their perspective, if all the data isn’t open, then AI large language models (LLM) based on such data can’t be open-source.

Also: How open source attracts some of the world’s top innovators

The OSI summarized these arguments as follows: “Some people believe that full, unfettered access to all training data (with no distinction of its kind) is paramount, arguing that anything less would compromise full reproducibility of AI systems, transparency, and security. This approach would relegate Open-Source AI to a niche of AI trainable only on open data.”

They’re not wrong.  

Yes, ideally, the OSI agrees all the training data should be shared and disclosed. However,  there are four different data types: Open, public, obtainable, and unshareable data. “The legal requirements are different for each. All are required to be shared in the form that the law allows them to be shared.”

In short, “Data can be hard to share. Laws permitting training on data often limit the resharing of that data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health.”

Also: Open source is actually the cradle of artificial intelligence. Here’s why

The release candidate also addresses other key components of AI systems. It mandates that the complete source code used for training and running the system be available under OSI-approved licenses. Similarly, model parameters and weights must be shared under open terms.

Stefano Maffulli, the OSI’s executive director, emphasized the importance of this definition in combating “open washing” — the practice of companies claiming openness without meeting true open-source standards. “If a company says it’s open source, it must carry the values that the open-source definition carries. Otherwise, it’s just confusing.”

In an Open Source Summit  Europe interview in Vienna, Austria, Mafulli told me it’s not just open-source purists who are unhappy with the proposed OSI AI Definition. The other “are corporations, who regard their training schemes and the way they run the training and assemble and filter data sets and create data sets as trade secrets. They don’t want to release those. They think we’re asking too much. It’s an old argument that we heard in the 90s when Microsoft did not want to release their source code or to build instructions.”

In addition, RC1 has two new features. The first is that open-source AI Code must be enough for downstream recipients to understand how the machine language training was done. Training is where innovation is happening and, according to the OSI, that’s “why you don’t see corporations releasing their training and data processing code.” Given the current status of knowledge and practice, this is required to meaningfully fork AI systems.

Also: IBM will train you in AI fundamentals for free, and give you a skill credential – in 10 hours

Lastly, new text acknowledges that creators can explicitly require copyleft terms for open-source AI code, data, and parameters, either individually or as bundled combinations. An example of this would be if a “consortium owning rights to training code and a dataset decided to distribute the bundle code and data with legal terms that tie the two together, with copyleft-like provisions.”

Mind you, the OSI continued, “This sort of legal document doesn’t exist yet, but the scenario is plausible enough that it deserves consideration.”

Don’t think the definition is done and dusted yet. It’s not. True, the OSI doesn’t plan to add new features. From here on out, they and their partners will work on bug fixes. The OSI admits that there may still be “major flaws that may require significant rewrites to the text.” However, the main focus will be on the accompanying documentation.

Also: Google’s AI podcast tool transforms your text into stunningly lifelike audio – for free

In addition, the OSI has “realized that in our zeal to solve the problem of data that needs to be provided but cannot be supplied by the model owner for good reasons, we had failed to make clear the basic requirement that ‘if you can share the data you must.'” 

If all goes smoothly, the OSI plans to release the final 1.0 version of the Open Source AI Definition at the All Things Open conference on October 28, 2024. Hang tight, folks. We’re getting there. 





Source link

Leave a Comment