How data stores and governance impact your AI initiatives

Organizations with a firm grasp on how, where, and when to use artificial intelligence (AI) can take advantage of any number of AI-based capabilities such as:

Content generation
Task automation
Code creation
Large-scale classification
Summarization of dense and/or complex documents
Information extraction
IT security optimization

Be it healthcare, hospitality, finance, or manufacturing, the beneficial use cases of AI are virtually limitless in every industry. But the implementation of AI is only one piece of the puzzle.

The tasks behind efficient, responsible AI lifecycle management

The continuous application of AI and the ability to benefit from its ongoing use require the persistent management of a dynamic and intricate AI lifecycle—and doing so efficiently and responsibly. Here’s what’s involved in making that happen.

Connecting AI models to a myriad of data sources across cloud and on-premises environments

AI models rely on vast amounts of data for training. Whether building a model from the ground up or fine-tuning a foundation model, data scientists must utilize the necessary training data regardless of that data’s location across a hybrid infrastructure. Once trained and deployed, models also need reliable access to historical and real-time data to generate content, make recommendations, detect errors, send proactive alerts, etc.

Scaling AI models and analytics with trusted data

As a model grows or expands in the kinds of tasks it can perform, it needs a way to connect to new data sources that are trustworthy, without hindering its performance or compromising systems and processes elsewhere.

Securing AI models and their access to data

While AI models need flexibility to access data across a hybrid infrastructure, they also need safeguarding from tampering (unintentional or otherwise) and, especially, protected access to data. The term “protected” means that:

An AI model and its data sources are safe from unauthorized manipulation
The data pipeline (the path the model follows to access data) remains intact
The chance of a data breach is minimized to the fullest extent possible, with measures in place to help detect breaches early

Monitoring AI models for bias and drift

AI models aren’t static. They’re built on machine learning algorithms that create outputs based on an organization’s data or other third-party big data sources. Sometimes, these outputs are biased because the data used to train the model was incomplete or inaccurate in some way. Bias can also find its way into a model’s outputs long after deployment. Likewise, a model’s outputs can “drift” away from their intended purpose and become less accurate—all because the data a model uses and the conditions in which a model is used naturally change over time. Models in production, therefore, must be continuously monitored for bias and drift.

Ensuring compliance with governmental regulatory requirements as well as internal policies

An AI model must be fully understood from every angle, inside and out—from what enterprise data is used and when, to how the model arrived at a certain output. Depending on where an organization conducts business, it will need to comply with any number of government regulations regarding where data is stored and how an AI model uses data to perform its tasks. Current regulations are always changing, and new ones are being introduced all the time. So, the greater the visibility and control an organization has over its AI models now, the better prepared it will be for whatever AI and data regulations are coming around the corner.

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as:

The AI model’s creation (when it was created, who created it, etc.)
Training data used to develop it
Geographic location of a model deployment and its data
Update history
Outputs generated or actions taken over time

With metadata management and the ability to generate reports with ease, data stewards are better equipped to demonstrate compliance with a variety of existing data privacy regulations, such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA) or the Health Insurance Portability and Accountability Act (HIPAA).

Accounting for the complexities of the AI lifecycle

Unfortunately, typical data storage and data governance tools fall short in the AI arena when it comes to helping an organization perform the tasks that underline efficient and responsible AI lifecycle management. And that makes sense. After all, AI is inherently more complex than standard IT-driven processes and capabilities. Traditional IT solutions simply aren’t dynamic enough to account for the nuances and demands of using AI.

To maximize the business outcomes that can come from using AI while also controlling costs and reducing inherent AI complexities, organizations need to combine AI-optimized data storage capabilities with a data governance program exclusively made for AI.

AI-optimized data stores enable cost-effective AI workload scalability

AI models rely on secure access to trustworthy data, but organizations seeking to deploy and scale these models face an increasingly large and complicated data landscape. Stored data is predicted to see a 250% growth by 2025,¹ the results of which are likely to include a greater number of disconnected silos and higher associated costs.

To optimize data analytics and AI workloads, organizations need a data store built on an open data lakehouse architecture. This type of architecture combines the performance and usability of a data warehouse with the flexibility and scalability of a data lake. IBM watsonx.data is an example of an open data lakehouse, and it can help teams:

Enable the processing of large volumes of data efficiently, helping to reduce AI costs
Ensure AI models have the reliable use of data from across hybrid environments within a scalable, cost-effective container
Give data scientists a repository to gather and cleanse data used to train AI models and fine-tune foundation models
Eliminate redundant copies of datasets, reducing hardware requirements and lowering storage costs
Promote greater levels of data security by limiting users to isolated datasets

AI governance delivers transparency and accountability

Building and integrating AI models into an organization’s daily workflows require transparency into how those models work and how they were created, control over what tools are used to develop models, the cataloging and monitoring of those models and the ability to report on model behavior. Otherwise:

Data scientists may resort to a myriad of unapproved tools, applications, practices and platforms, introducing human errors and biases that impact model deployment times
The ability to explain model outcomes accurately and confidently is lost
It remains difficult to detect and mitigate bias and drift
Organizations put themselves at risk of non-compliance or the inability to even prove compliance

Much in the way a data governance framework can provide an organization with the means to ensure data availability and proper data management, allow self-service access and better protect its network, AI governance processes enable the monitoring and managing of AI workflows through-out the entire AI lifecycle. Solutions such as IBM watsonx.governance are specially designed to help:

Streamline model processes and accelerate model deployment
Detect risks hiding within models before deployment or while in production
Ensure data quality is upheld and protect the reliability of AI-driven business intelligence tools that inform an organization’s business decisions
Drive ethical and compliant practices
Capture model facts and explain model outcomes to regulators with clarity and confidence
Follow the ethical guidelines set forth by internal and external stakeholders
Evaluate the performance of models from an efficiency and regulatory standpoint through analytics and the capturing/visualization of metrics

With AI governance practices in place, an organization can provide its governance team with an in-depth and centralized view over all AI models that are in development or production. Checkpoints can be created throughout the AI lifecycle to prevent or mitigate bias and drift. Documentation can also be generated and maintained with information such as a model’s data origins, training methods and behaviors. This allows for a high degree of transparency and auditability.

Fit-for-purpose data stores and AI governance put the business benefits of responsible AI within reach

AI-optimized data stores that are built on open data lakehouse architectures can ensure fast access to trusted data across hybrid environments. Combined with powerful AI governance capabilities that provide visibility into AI processes, models, workflows, data sources and actions taken, they deliver a strong foundation for practicing responsible AI.

Responsible AI is the mission-critical practice of designing, developing and deploying AI in a manner that is fair to all stakeholders—from workers across various business units to everyday consumers—and compliant with all policies. Through responsible AI, organizations can:

Avoid the creation and use of unfair, unexplainable or biased AI
Stay ahead of ever-changing government regulations regarding the use of AI
Know when a model needs retraining or rebuilding to ensure adherence to ethical standards

By combining AI-optimized data stores with AI governance and scaling AI responsibly, an organization can achieve the numerous benefits of responsible AI, including:

1. Minimized unintended bias—An organization will know exactly what data its AI models are using and where that data is located. Meanwhile, data scientists can quickly disconnect or connect data assets as needed via self-service data access. They can also spot and root out bias and drift proactively by monitoring, cataloging and governing their models.

2. Security and privacy—When all data scientists and AI models are given access to data through a single point of entry, data integrity and security are improved. A single point of entry eliminates the need to duplicate sensitive data for various purposes or move critical data to a less secure (and possibly non-compliant) environment.

3. Explainable AI—Explainable AI is achieved when an organization can confidently and clearly state what data an AI model used to perform its tasks. Key to explainable AI is the ability to automatically compile information on a model to better explain its analytics decision-making. Doing so allows easy demonstration of compliance and reduces exposure to possible audits, fines and reputational damage.

Author

IBM Data and AI Team

Footnotes

1. Worldwide IDC Global DataSphere Forecast, 2022–2026: Enterprise Organizations Driving Most of the Data Growth, May 2022

AI governance for the enterprise

Learn the key benefits gained with automated AI governance for both today’s generative AI and traditional machine learning models.