New collaborative data license agreement created to make sharing data sets easier

Blog Post

New collaborative data license agreement created to make sharing data sets easier

Linux Foundation AI releases Community Data License Agreement v2 license, 364 words of data-sharing clarity

By Todd Moore, Ruchir Puri
Published June 22, 2021

As more organizations embrace artificial intelligence (AI) technologies and big data, there is a growing need to share and collaborate with data sets to analyze and use for AI training. But just as you should not release valuable software to the public without choosing an appropriate open source software license, you should not release data sets without a proper license written specifically for sharing data.

Today, the Linux Foundation AI announced the release of CDLA-Permissive-2.0 license agreement, developed to make it easier than ever for governments, academic institutions, businesses, and other organizations to share, access, and protect open source data sets.

What does this release offer?

Like version 1.0, the version 2.0 agreement maintains the clear rights to use, share and modify the data, as well as to use without restriction any results that are generated through data analysis.

Enhancements include:

Plain language to express the grant of permissions and requirements
One page of information to easily read and understand
Simplified experience for adopting the license
Support from IBM and other industry leaders

To recap, it is beautifully short–just 364 words long!–and tuned to data sets for AI use cases. A new license for a new era of data and AI.

IBM is using the license in our data sets

At IBM, we are excited about how this new license will enable better sharing of data sets that can be used in AI and machine learning work. We believe that the ease of using the new CDLA v2 license will benefit the AI and data science community. Today, we are announcing that one of the first IBM data sets to carry the CDLA-Permissive-2.0 license is the Project CodeNet data set.

Project CodeNet is a large data set that is made available via IBM’s Data Asset eXchange (DAX). Its purpose is to train AI models to understand and write code. The data set consists of some 14M code samples, about 500M lines of code, in 55+ different programming languages. We believe Project CodeNet can serve as a benchmark data set for source-to-source translation and aims to do for AI and code what the ImageNet dataset did years ago for computer vision.

Although open source software has a number of widely accepted licenses that have helped it thrive, these same licenses can’t be applied to the way data is shared. Similarly, licenses that govern sharing data for creative content don’t usually account for AI and machine learning use cases.

The laws and regulations that govern data sharing have different requirements. The types of data, the location where it is stored or accessed, and the way it’s consumed in AI or machine learning models all have different governance standards. Commonly used licenses for software and creative content might not apply in the intended ways for open data.

The CDLA permissive license was created to address concerns related to AI and ML models generated from open data. Because of our leadership in AI and experience writing open source licenses, IBM was involved in creating version 1 of the CDLA license.

What’s the difference between v1 and v2?

In 2017, IBM engaged the Linux Foundation with the early thinking around licenses for data sets because of our experience with AI and open source. After collaborating with the Linux Foundation and others, CDLA v1 was released. Feedback about the first version of the license suggested that it was overly complex for non-lawyers to use. To address these concerns, in 2019, Microsoft launched the Open Use of Data Agreement (O-UDA-1.0) to provide a more concise and simplified set of terms around the sharing and use of data for similar purposes. They contributed this license to the Linux Foundation in order to bring the two licenses together.

CDLA-Permissive v2 was developed to take the best from the original CDLA-Permissive v1 release and bring in the simplicity of the O-UDA-1.0 to offer a streamlined, simpler license that most readers can use and understand.

Use the new license with your data sets

If you’re involved with AI or machine learning, be sure to check out the license and use with your own data sets to speed collaboration and innovation in AI.

Open Source @ IBM Blog

Blog Post