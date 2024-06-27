Generative AI has altered the tech industry by introducing new data risks, such as sensitive data leakage through large language models (LLMs), and driving an increase in requirements from regulatory bodies and governments. To navigate this environment successfully, it is important for organizations to look at the core principles of data management. And ensure that they are using a sound approach to augment large language models with enterprise/non-public data.

A good place to start is refreshing the way organizations govern data, particularly as it pertains to its usage in generative AI solutions. For example:

Validating and creating data protection capabilities: Data platforms must be prepped for higher levels of protection and monitoring. This requires traditional capabilities like encryption, anonymization and tokenization, but also creating capabilities to automatically classify data (sensitivity, taxonomy alignment) by using machine learning. Data discovery and cataloging tools can assist but should be augmented to make the classification specific to the organization’s understanding of its own data. This allows organizations to effectively apply new policies and bridge the gap between conceptual understandings of data and the reality of how data solutions have been implemented.

Improving controls, auditability and oversight: Data access, usage and third-party engagement with enterprise data requires new designs with existing solutions. For example, capture a portion of the requirements that are needed to ensure authorized usage of the data. But firms need complete audit trails and monitoring systems. This is to track how data is used, when data is modified, and if data is shared through third-party interactions for both gen AI and non-gen AI solutions. It is no longer sufficient to control data by restricting access to it, and we should also track the use cases for which data is accessed and applied within analytical and operational solutions. Automated alerts and reporting of improper access and usage (measured by query analysis, data exfiltration and network movement) should be developed by infrastructure and data governance teams and reviewed regularly to proactively ensure compliance.

Preparing data for gen AI: There is a departure from traditional data management patterns and skills which requires new discipline to ensure the quality, accuracy and relevance of data for training and augmenting language models for AI use. With vector databases becoming commonplace in the gen AI domain, data governance must be enhanced to account for non-traditional data management platforms. This is to ensure that the same governance practices are applied to these new architectural components. Data lineage becomes even more important as the need to provide “Explainability” in models is required by regulatory bodies.

Enterprise data is often complex, diverse and scattered across various repositories, making it difficult to integrate into gen AI solutions. This complexity is compounded by the need to ensure regulatory compliance, mitigate risk, and address skill gaps in data integration and retrieval-augmented generation (RAG) patterns. Moreover, data is often an afterthought in the design and deployment of gen AI solutions, leading to inefficiencies and inconsistencies.