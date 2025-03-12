Addressing Critical Blockers to Effective Data Set Sharing and Use

As the National AI R&D Strategic Plan currently states, data sharing is essential for AI. The sharing of datasets is mandatory yet is still massively hindered by opacity in terms and conditions as well as data use policies and regulations. We need ways to safely share data and unlock their value, for both corporations as well as research institutions. Innovation is needed on methodologies and license terms and conditions to enable this sharing in a clear and frictionless manner across different contexts, as well as standards on meta-data specifications for describing these datasets, including the provenance information.

The creation of simpler and more standardized data sharing and licensing models will create greater certainty for large and small entities and reward developers who carefully play by the rules. For example, the Linux Foundation has made available model Community Data License Agreements to provide a licensing framework to support collaborative communities built around curating and sharing “open” data. Efforts to create datasets for which permitted use in the field of AI R&D is both clear and broad, including with respect to the requirements of privacy law, will pay great dividends in reducing transaction costs currently associated with legal uncertainty. The solutions lie in better licensing terms and methodologies for safe creation and responsible use of shared data that spurs scientific progress.

Recommendations:

Institutional and policy innovation is required to ensure that a) the data gathering practice produces data with integrity and availability in mind and that b) new standards are created for data usage terms and conditions and meta-data specifications for describing the data sets and their provenance.

Several emerging technical areas could be the key to innovation in this space. Privacy protecting federated learning is a promising area that can perhaps balance the needs for protecting privacy and the needs for more data sharing for better AI by enabling collaborative learning without giving away one’s data. The increasingly popular blockchain technology has great potential to ensure scalable and secure data governance.

Developing public shared AI models and environment to share them

In addition to sharing datasets, the sharing of models should be equally encouraged. Model ‘zoos’ are emerging for different communities with no standard format or structure. Additionally, innovation is needed to standardize on model quality metrics across dimensions beyond accuracy: the lineage of a model, its license and usage terms taking into account the terms of the data on which it was trained, metrics on robustness and fairness. There has been some movement in this direction such as in IBM’s push for AI FactSheets.

Advocating for Open Government Data and Models

In alignment with the National AI R&D Strategic Plan goal to make a wide variety of datasets accessible for AI, we advocate for open government data, which is critically important for AI. Currently, only a fraction of existing government datasets is available in full, free, and usable formats. We support government efforts to open more of its datasets in accessible and usable formats with significant positive implications through integration into AI systems. Government data reflects transparent collection methods, particularly when provided with data provenance since users obtain data directly from the source. As such, when these data feed into AI systems, they add transparency to the systems. AI systems using government data as input are more transparent since the source of data is clearly known. Open government data also open a vast resource of high quantity and quality of datasets. Currently the reserves of open data are a very limited pool of input, with which we have still made remarkable gains in AI technology. With more data, we can have more AI applications and more accurate AI systems. Furthermore, the quality of AI solutions relies on the quality of data input. Thanks to agency collection standards developed decades ago, government data offer long and consistent records of information. The longevity and consistency of data records provide robust datasets that can contribute to robust AI systems. Perhaps even more importantly, open government data are a valuable tool in the AI battle on bias. They provide a vast number of diverse datasets from different regions, economic classes, and sectors. The more diversity reflected in data translates to more diverse AI systems and outcomes. Furthermore, government data also reduce the digital divide because they represent all parts of the population, not just the people with digital access, which most data sources only represent. Finally, in many cases the greater number of datasets available for input allows for a larger sample size within which to compare, detect, and eliminate a biased dataset.

Open Platforms and Reproducibility

We are encouraged by the emergence of cloud training and testing environments, as well as the movement for AI conferences to encourage reproducibility in papers through repeatable experiments and code. The start of benchmarks such as DAWNbench are also helpful in advancing and leveling the playing field. Open source has proven to be a very strong factor in AI, which is also very promising in terms of the future. Open APIs have increased innovation that scales aggressively. Therefore, we recommend an increased focus and to continue to encourage and incentivize these efforts.