Generative AI has altered the tech industry by introducing new data risks, such as sensitive data leakage through large language models (LLMs), and driving an increase in requirements from regulatory bodies and governments. To navigate this environment successfully, it is important for organizations to look at the core principles of data management. And ensure that they are using a sound approach to augment large language models with enterprise/non-public data.
A good place to start is refreshing the way organizations govern data, particularly as it pertains to its usage in generative AI solutions. For example:
- Validating and creating data protection capabilities: Data platforms must be prepped for higher levels of protection and monitoring. This requires traditional capabilities like encryption, anonymization and tokenization, but also creating capabilities to automatically classify data (sensitivity, taxonomy alignment) by using machine learning. Data discovery and cataloging tools can assist but should be augmented to make the classification specific to the organization’s understanding of its own data. This allows organizations to effectively apply new policies and bridge the gap between conceptual understandings of data and the reality of how data solutions have been implemented.
- Improving controls, auditability and oversight: Data access, usage and third-party engagement with enterprise data requires new designs with existing solutions. For example, capture a portion of the requirements that are needed to ensure authorized usage of the data. But firms need complete audit trails and monitoring systems. This is to track how data is used, when data is modified, and if data is shared through third-party interactions for both gen AI and non-gen AI solutions. It is no longer sufficient to control data by restricting access to it, and we should also track the use cases for which data is accessed and applied within analytical and operational solutions. Automated alerts and reporting of improper access and usage (measured by query analysis, data exfiltration and network movement) should be developed by infrastructure and data governance teams and reviewed regularly to proactively ensure compliance.
- Preparing data for gen AI: There is a departure from traditional data management patterns and skills which requires new discipline to ensure the quality, accuracy and relevance of data for training and augmenting language models for AI use. With vector databases becoming commonplace in the gen AI domain, data governance must be enhanced to account for non-traditional data management platforms. This is to ensure that the same governance practices are applied to these new architectural components. Data lineage becomes even more important as the need to provide “Explainability” in models is required by regulatory bodies.
Enterprise data is often complex, diverse and scattered across various repositories, making it difficult to integrate into gen AI solutions. This complexity is compounded by the need to ensure regulatory compliance, mitigate risk, and address skill gaps in data integration and retrieval-augmented generation (RAG) patterns. Moreover, data is often an afterthought in the design and deployment of gen AI solutions, leading to inefficiencies and inconsistencies.
Unlocking the full potential of enterprise data for generative AI
At IBM, we have developed an approach to solving these data challenges. The IBM gen AI data ingestion factory, a managed service designed to address AI’s “data problem” and unlock the full potential of enterprise data for gen AI. Our predefined architecture and code blueprints that can be deployed as a managed service simplify and accelerate the process of integrating enterprise data into gen AI solutions. We approach this problem with data management in mind, preparing data for governance, risk and compliance from the outset.
Our core capabilities include:
- Scalable data ingestion: Re-usable services to scale data ingestion and RAG across gen AI use cases and solutions, with optimized chunking and embedding patterns.
- Regulatory and compliance: Data is prepared for gen AI usage that meets current and future regulations, helping companies meet compliance requirements with market regulations focused on generative AI.
- Data privacy management: Long-form text can be anonymized as it is discovered, reducing risk and ensuring data privacy.
The service is AI and data platform agnostic, allowing for deployment anywhere, and it offers customization to client environments and use cases. By using the IBM® gen AI data ingestion factory, enterprises can achieve several key outcomes, including:
- Reducing time spent on data integration: A managed service that reduces the time and effort required to solve for AI’s “data problem”. For example, using a repeatable process for “chunking” and “embedding” data so that it does not require development efforts for each new gen AI use case.
- Compliant data usage: Helping to comply with data usage regulations focused on gen AI applications deployed by the enterprise. For example, ensuring data that is sourced in RAG patterns is approved for enterprise usage in gen AI solutions.
- Mitigating risk: Reducing risk associated with data used in gen AI solutions. For example, providing transparent results into what data was sourced to produce an output from a model reduces model risk and time spent proving to regulators how information was sourced.
- Consistent and reproducible results: Delivering consistent and reproducible results from LLMs and gen AI solutions. For example, capturing lineage and comparing outputs (that is, data generated) over time to report on consistency through standard metrics such as ROUGE and BLEU.
Navigating the complexities of data risk requires a cross-functional expertise. Our team of former regulators, industry leaders and technology experts at IBM Consulting® are uniquely positioned to address this with our consulting services and solutions.
Please see more on our following capabilities and reach out to me at gsbaird@us.ibm.com for any further questions.
Learn more about how AI governance can help fight data risks