How IBM is gaining operational efficiency through enhanced data provenance transparency

Two engineers looking at an iPad in a large room with glass cage and machines

Authors

Vice President

Chief Privacy & Trust Officer

AI systems can only be as trustworthy as the data that is used to develop them. That’s why using high-quality, trusted data is a critical first step toward building responsible AI. But without transparency on data provenance—details about where data originated, how it was developed and how it can be used from a legal and contractual standpoint —evaluating the trustworthiness of a data set can be challenging, even for seasoned data professionals. The lack of a standard metadata taxonomy for data sets is a common pain point across the data ecosystem.

So when the Data & Trust Alliance (D&TA) undertook the development of the very first cross-industry Data Provenance Standards, IBM® was eager to contribute. Throughout 2024, we led early testing efforts and were among the first organizations to begin aligning our internal data standards with the Data Provenance Standards, where appropriate. Now, three months after we concluded our testing and V1.0 of the Data Provenance Standards was formally announced, we have seen a consistent and quantifiable impact on the overall efficiency of our data diligence and management processes.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

IBM as “Client Zero” for Data Provenance Standards implementation

IBM is committed to developing and deploying AI responsibly. And that commitment extends to the data that we use to build and train our AI systems. As “Client Zero,” we wanted to assess the Data Provenance Standards in a rigorous environment to truly understand their impact and put them to the test in a meaningful way. So, we implemented key elements within our own Integrated Governance Program (IGP) that governs data and models developed and used by IBM, starting with an evaluation of the standards’ comprehensiveness. To do this, we compared the Data Provenance Standards to our own data intake requirements for data sets that are used to develop foundation models and we assessed the degree to which the metadata taxonomy of the Data Provenance Standards enabled us to validate data suitability for a variety of use cases.

Next, we asked IBM data scientists and researchers of various levels of experience to apply the Data Provenance Standards to several common types of data, including IBM proprietary data, third-party data and data that includes HAP (hate speech, abusive language and profanity) material.

Finally, we asked experts from the IBM Office of Privacy and Responsible Technology to examine the completeness and accuracy of the metadata submissions in accordance with the Data Provenance Standards, reviewing the submissions with the data scientists and researchers to better understand their pain points or confusion. This qualitative feedback enabled us to pinpoint terms, definitions and guidance that were unclear or ambiguous.

Mixture of Experts | 30 July, episode 118

Your weekly news podcast for AI enthusiasts

Hear from industry experts on the latest in AI news, listen to the Mixture of Experts podcast. New episodes on Fridays at 6 AM EST.

Go to episodes

How data provenance transparency translates to greater operational efficiency

The most notable impact we’ve observed since more closely aligning our internal data standards with the Data Provenance Standards is a reduction in the time it takes to process data clearance requests. In the eight-month period during which we tested the Data Provenance Standards and implemented other technology and process enhancements, we observed that the average data clearance processing time decreased by 58% for third-party data and 62% for IBM-proprietary data. This improvement is particularly important given the surge in clearance requests coming through IGP. By August 2024, the number of clearance requests for both third-party and IBM-proprietary data had already surpassed the total number for all of 2023.

This improved efficiency is highly valuable. Our data governance team is able to process more data requests with greater speed, enabling us to scale up our data governance program while maintaining our standards for trust and transparency. Some aspects of the Data Provenance Standards that helped us accelerate our data diligence processes include the following:

Method: Describes procedures used to collect, generate or compile the data. This element is important because aggregators often do not make these details available, making it more difficult to assess the reliability and validity of the data.
Confidentiality classification: Specifies the types of sensitive data known to be present in the data. This classification guides proper data access and handling.
Data issuer: Describes where the data originated and whether the provider is the actual owner. Because third parties can republish data as if it were their own, this element enables accountability and opens a line of contact for potential inquiries.

This has a ripple effect across our entire enterprise. When data clearance requests are accurate and processed more efficiently, model development is accelerated, empowering our teams to respond faster to client requests. It also means that our cross-enterprise catalog of cleared data is always expanding and improving in quality, allowing more efficient and responsible re-use by our practitioners across the business.

Unlocking new business value through data provenance transparency

Transparent and consistent metadata allows practitioners to make faster, more informed choices about data selection, which can ultimately lead to more responsible models and systems. That’s true not only for IBM, but also across the entire data ecosystem. Wider adoption of the Data Provenance Standards can deliver meaningful return on investment through both further automation and responsible innovation.

Through our “Client Zero” experience with the Data Provenance Standards, we are fortifying our commitment to trust by raising the bar for transparency about the data that underlies our AI systems. Our experience administering our own Integrated Governance Program or IGP—including aligning our internal data standards more closely with the Data Provenance Standards—is enabling us to bring AI to market with greater speed and trust. It has also prepared us to better support clients in implementing their own data governance frameworks, including alignment with industry standards and frameworks like the Data Provenance Standards. After all, if we can make something work for IBM, we can certainly help our clients do the same.

Read our guide for getting started with AI Governance

Explore our AI Governance services

Start realizing ROI: A practical guide to agentic AI

Learn how to scale agentic AI for measurable ROI across your enterprise. This playbook outlines the top barriers that limit impact, how to effectively measure ROI and a practical framework to drive successful, enterprise-wide adoption.

Resources

Attackers are weaponing AI

AI-driven attacks increased 56%, led by deepfake impersonations and AI-enabled malware. Discover what's driving the surge.

Designing an AI native airline at enterprise scale

When margins are thin, every inefficiency matters. While legacy systems continue to constrain AI’s potential across aviation, Riyadh Air chose a different path. In partnership with IBM, Riyadh Air built the world’s first AI‑native airline, redefining a smarter, faster, more intuitive way to travel.

The enterprise in 2030: Engineered for perpetual innovation

Discover our five predictions about what will define the most successful enterprises in 2030 and the steps leaders can take to gain an AI-first advantage.

Start realizing ROI: A practical guide to agentic AI

Discover ways to get ahead, successfully scaling AI across your business with real results.

Level up your AI expertise

Purchase an individual or multi-user subscription today to access our full catalog of over 100 online courses. Expand your skills across a wide range of our products at a low price.

From AI projects to profits: How agentic AI can sustain financial returns

Discover how organizations are moving from isolated AI pilots to driving core business transformation with agentic AI.

Explore IBM Granite

IBM Granite® is a family of open, high performance and trusted AI models designed for business and optimized to scale your AI applications. Explore options across language, code, time series and guardrails.

IBM AI Academy

Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.

Unlock the power of generative AI and ML

Learn how to confidently incorporate generative AI and machine learning into your business.

How to thrive in this new era of AI with trust and confidence

Dive into the three critical elements of a strong AI strategy—creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

How IBM is gaining operational efficiency through enhanced data provenance transparency

The latest AI News + Insights

IBM as “Client Zero” for Data Provenance Standards implementation

Your weekly news podcast for AI enthusiasts

How data provenance transparency translates to greater operational efficiency

Unlocking new business value through data provenance transparency

Share

Resources

The latest AI News + Insights