IBM is committed to developing and deploying AI responsibly. And that commitment extends to the data that we use to build and train our AI systems. As “Client Zero,” we wanted to assess the Data Provenance Standards in a rigorous environment to truly understand their impact and put them to the test in a meaningful way. So, we implemented key elements within our own Integrated Governance Program (IGP) that governs data and models developed and used by IBM, starting with an evaluation of the standards’ comprehensiveness. To do this, we compared the Data Provenance Standards to our own data intake requirements for data sets that are used to develop foundation models and we assessed the degree to which the metadata taxonomy of the Data Provenance Standards enabled us to validate data suitability for a variety of use cases.

Next, we asked IBM data scientists and researchers of various levels of experience to apply the Data Provenance Standards to several common types of data, including IBM proprietary data, third-party data and data that includes HAP (hate speech, abusive language and profanity) material.

Finally, we asked experts from the IBM Office of Privacy and Responsible Technology to examine the completeness and accuracy of the metadata submissions in accordance with the Data Provenance Standards, reviewing the submissions with the data scientists and researchers to better understand their pain points or confusion. This qualitative feedback enabled us to pinpoint terms, definitions and guidance that were unclear or ambiguous.