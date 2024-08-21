Large language models can be trained on proprietary data to fulfill specific enterprise use cases. For example, a company could take ChatGPT and create a private model that is trained on the company’s CRM sales data. This model could be deployed as a Slack chatbot to help sales teams find answers to queries like “How many opportunities has product X won in the last year?” or “Update me on product Z’s opportunity with company Y”.

You could easily imagine these LLMs being tuned for any number of customer service, HR or marketing use cases. We might even see these augmenting legal and medical advice, turning LLMs into a first-line diagnostic tool used by healthcare providers. The problem is that these use cases require training LLMs on sensitive proprietary data. This is inherently risky. Some of these risks include:

1. Privacy and re-identification risk

AI models learn from training data, but what if that data is private or sensitive? A considerable amount of data can be directly or indirectly used to identify specific individuals. So, if we are training a LLM on proprietary data about an enterprise’s customers, we can run into situations where the consumption of that model could be used to leak sensitive information.

2. In-model learning data

Many simple AI models have a training phase and then a deployment phase during which training is paused. LLMs are a bit different. They take the context of your conversation with them, learn from that, and then respond accordingly.

This makes the job of governing model input data infinitely more complex as we don’t just have to worry about the initial training data. We also worry about every time the model is queried. What if we feed the model sensitive information during conversation? Can we identify the sensitivity and prevent the model from using this in other contexts?

3. Security and access risk

To some extent, the sensitivity of the training data determines the sensitivity of the model. Although we have well established mechanisms for controlling access to data — monitoring who is accessing what data and then dynamically masking data based on the situation— AI deployment security is still developing. Although there are solutions popping up in this space, we still can’t entirely control the sensitivity of model output based on the role of the person using the model (e.g., the model identifying that a particular output could be sensitive and then reliably changes the output based on who is querying the LLM). Because of this, these models can easily become leaks for any type of sensitive information involved in model training.

4. Intellectual Property risk

What happens when we train a model on every song by Drake and then the model starts generating Drake rip-offs? Is the model infringing on Drake? Can you prove if the model is somehow copying your work?

This problem is still being figured out by regulators, but it could easily become a major issue for any form of generative AI that learns from artistic intellectual property. We expect this will lead into major lawsuits in the future, and that will have to be mitigated by sufficiently monitoring the IP of any data used in training.

5. Consent and DSAR risk

One of the key ideas behind modern data privacy regulation is consent. Customers must consent to use of their data and they must be able to request that their data is deleted. This poses a unique problem for AI usage.

If you train an AI model on sensitive customer data, that model then becomes a possible exposure source for that sensitive data. If a customer were to revoke company usage of their data (a requirement for GDPR) and if that company had already trained a model on the data, the model would essentially need to be decommissioned and retrained without access to the revoked data.

Making LLMs useful as enterprise software requires governing the training data so that companies can trust the safety of the data and have an audit trail for the LLM’s consumption of the data.