Read time
AI systems can only be as trustworthy as the data that is used to develop them. That’s why using high-quality, trusted data is a critical first step toward building responsible AI. But without transparency on data provenance—details about where data originated, how it was developed and how it can be used from a legal and contractual standpoint —evaluating the trustworthiness of a data set can be challenging, even for seasoned data professionals. The lack of a standard metadata taxonomy for data sets is a common pain point across the data ecosystem.
So when the Data & Trust Alliance (D&TA) undertook the development of the very first cross-industry Data Provenance Standards, IBM® was eager to contribute. Throughout 2024, we led early testing efforts and were among the first organizations to begin aligning our internal data standards with the Data Provenance Standards, where appropriate. Now, three months after we concluded our testing and V1.0 of the Data Provenance Standards was formally announced, we have seen a consistent and quantifiable impact on the overall efficiency of our data diligence and management processes.
IBM is committed to developing and deploying AI responsibly. And that commitment extends to the data that we use to build and train our AI systems. As “Client Zero,” we wanted to assess the Data Provenance Standards in a rigorous environment to truly understand their impact and put them to the test in a meaningful way. So, we implemented key elements within our own Integrated Governance Program (IGP) that governs data and models developed and used by IBM, starting with an evaluation of the standards’ comprehensiveness. To do this, we compared the Data Provenance Standards to our own data intake requirements for data sets that are used to develop foundation models and we assessed the degree to which the metadata taxonomy of the Data Provenance Standards enabled us to validate data suitability for a variety of use cases.
Next, we asked IBM data scientists and researchers of various levels of experience to apply the Data Provenance Standards to several common types of data, including IBM proprietary data, third-party data and data that includes HAP (hate speech, abusive language and profanity) material.
Finally, we asked experts from the IBM Office of Privacy and Responsible Technology to examine the completeness and accuracy of the metadata submissions in accordance with the Data Provenance Standards, reviewing the submissions with the data scientists and researchers to better understand their pain points or confusion. This qualitative feedback enabled us to pinpoint terms, definitions and guidance that were unclear or ambiguous.
The most notable impact we’ve observed since more closely aligning our internal data standards with the Data Provenance Standards is a reduction in the time it takes to process data clearance requests. In the eight-month period during which we tested the Data Provenance Standards and implemented other technology and process enhancements, we observed that the average data clearance processing time decreased by 58% for third-party data and 62% for IBM-proprietary data. This improvement is particularly important given the surge in clearance requests coming through IGP. By August 2024, the number of clearance requests for both third-party and IBM-proprietary data had already surpassed the total number for all of 2023.
This improved efficiency is highly valuable. Our data governance team is able to process more data requests with greater speed, enabling us to scale up our data governance program while maintaining our standards for trust and transparency. Some aspects of the Data Provenance Standards that helped us accelerate our data diligence processes include the following:
This has a ripple effect across our entire enterprise. When data clearance requests are accurate and processed more efficiently, model development is accelerated, empowering our teams to respond faster to client requests. It also means that our cross-enterprise catalog of cleared data is always expanding and improving in quality, allowing more efficient and responsible re-use by our practitioners across the business.
Transparent and consistent metadata allows practitioners to make faster, more informed choices about data selection, which can ultimately lead to more responsible models and systems. That’s true not only for IBM, but also across the entire data ecosystem. Wider adoption of the Data Provenance Standards can deliver meaningful return on investment through both further automation and responsible innovation.
Through our “Client Zero” experience with the Data Provenance Standards, we are fortifying our commitment to trust by raising the bar for transparency about the data that underlies our AI systems. Our experience administering our own Integrated Governance Program or IGP—including aligning our internal data standards more closely with the Data Provenance Standards—is enabling us to bring AI to market with greater speed and trust. It has also prepared us to better support clients in implementing their own data governance frameworks, including alignment with industry standards and frameworks like the Data Provenance Standards. After all, if we can make something work for IBM, we can certainly help our clients do the same.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at a low price.
Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com