My IBM Log in Subscribe

How IBM is gaining operational efficiency through enhanced data provenance transparency

4 November 2024

Read time

Authors

Christina Montgomery

Vice President

Chief Privacy & Trust Officer

AI systems can only be as trustworthy as the data that is used to develop them. That’s why using high-quality, trusted data is a critical first step toward building responsible AI. But without transparency on data provenance—details about where data originated, how it was developed and how it can be used from a legal and contractual standpoint —evaluating the trustworthiness of a data set can be challenging, even for seasoned data professionals. The lack of a standard metadata taxonomy for data sets is a common pain point across the data ecosystem.

So when the Data & Trust Alliance (D&TA) undertook the development of the very first cross-industry Data Provenance Standards, IBM® was eager to contribute. Throughout 2024, we led early testing efforts and were among the first organizations to begin aligning our internal data standards with the Data Provenance Standards, where appropriate. Now, three months after we concluded our testing and V1.0 of the Data Provenance Standards was formally announced, we have seen a consistent and quantifiable impact on the overall efficiency of our data diligence and management processes.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

IBM as “Client Zero” for Data Provenance Standards implementation

IBM is committed to developing and deploying AI responsibly. And that commitment extends to the data that we use to build and train our AI systems. As “Client Zero,” we wanted to assess the Data Provenance Standards in a rigorous environment to truly understand their impact and put them to the test in a meaningful way. So, we implemented key elements within our own Integrated Governance Program (IGP) that governs data and models developed and used by IBM, starting with an evaluation of the standards’ comprehensiveness. To do this, we compared the Data Provenance Standards to our own data intake requirements for data sets that are used to develop foundation models and we assessed the degree to which the metadata taxonomy of the Data Provenance Standards enabled us to validate data suitability for a variety of use cases.

Next, we asked IBM data scientists and researchers of various levels of experience to apply the Data Provenance Standards to several common types of data, including IBM proprietary data, third-party data and data that includes HAP (hate speech, abusive language and profanity) material.

Finally, we asked experts from the IBM Office of Privacy and Responsible Technology to examine the completeness and accuracy of the metadata submissions in accordance with the Data Provenance Standards, reviewing the submissions with the data scientists and researchers to better understand their pain points or confusion. This qualitative feedback enabled us to pinpoint terms, definitions and guidance that were unclear or ambiguous.

Mixture of Experts | 28 March, episode 48

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

How data provenance transparency translates to greater operational efficiency

The most notable impact we’ve observed since more closely aligning our internal data standards with the Data Provenance Standards is a reduction in the time it takes to process data clearance requests. In the eight-month period during which we tested the Data Provenance Standards and implemented other technology and process enhancements, we observed that the average data clearance processing time decreased by 58% for third-party data and 62% for IBM-proprietary data. This improvement is particularly important given the surge in clearance requests coming through IGP. By August 2024, the number of clearance requests for both third-party and IBM-proprietary data had already surpassed the total number for all of 2023.

This improved efficiency is highly valuable. Our data governance team is able to process more data requests with greater speed, enabling us to scale up our data governance program while maintaining our standards for trust and transparency. Some aspects of the Data Provenance Standards that helped us accelerate our data diligence processes include the following:

  • Method: Describes procedures used to collect, generate or compile the data. This element is important because aggregators often do not make these details available, making it more difficult to assess the reliability and validity of the data.
  • Confidentiality classification: Specifies the types of sensitive data known to be present in the data. This classification guides proper data access and handling.
  • Data issuer: Describes where the data originated and whether the provider is the actual owner. Because third parties can republish data as if it were their own, this element enables accountability and opens a line of contact for potential inquiries.

This has a ripple effect across our entire enterprise. When data clearance requests are accurate and processed more efficiently, model development is accelerated, empowering our teams to respond faster to client requests. It also means that our cross-enterprise catalog of cleared data is always expanding and improving in quality, allowing more efficient and responsible re-use by our practitioners across the business.

Unlocking new business value through data provenance transparency

Transparent and consistent metadata allows practitioners to make faster, more informed choices about data selection, which can ultimately lead to more responsible models and systems. That’s true not only for IBM, but also across the entire data ecosystem. Wider adoption of the Data Provenance Standards can deliver meaningful return on investment through both further automation and responsible innovation.

Through our “Client Zero” experience with the Data Provenance Standards, we are fortifying our commitment to trust by raising the bar for transparency about the data that underlies our AI systems. Our experience administering our own Integrated Governance Program or IGP—including aligning our internal data standards more closely with the Data Provenance Standards—is enabling us to bring AI to market with greater speed and trust. It has also prepared us to better support clients in implementing their own data governance frameworks, including alignment with industry standards and frameworks like the Data Provenance Standards. After all, if we can make something work for IBM, we can certainly help our clients do the same.

Read our guide for getting started with AI Governance

Explore our AI Governance services

Related solutions

Related solutions

IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo