Generative AI usage is growing exponentially, and with this privacy risks compound.
Data is everywhere in the life cycle of generative AI projects: it is the fuel used to create foundation models, more may be used to fine-tune them to specific tasks, finally, it is fed into them when in use. Those models offer very little in terms of right-to-be-forgotten. Unlearning is still an open problem and one should assume that the training set information may remain in the model forever. When personal data enters this cycle, it becomes a privacy time-bomb.
Privacy risk already exists in foundation models. But it’s mostly up to the handful of wealthy organizations that build them to deal with it (hello OpenAI, Google, or Meta!).
In this post, we will look into privacy risks for any enterprise that deploys Generative AI projects.
Privacy risks when prompting models
The most straightforward way to use generative models is to start a sentence with a bit of text (a prompt) and let them complete it. The prompt usually includes some instructions on how to carry out the task, possibly with examples, and the input to be analyzed (eg: a document or a message sent to a chatbot). It can be so rich that it has spurred an entire new field: prompt engineering.
As pictured by A16Z, if there is private information in a prompt, it’s on for a long voyage! It may be stored in unsecure places, re-used in future tuning operations, etc.
In most cases, the prompts include text provided by an employee or a client. It comes with absolutely no guarantee that it does not include personal data. On top of that, the instructions may be at risk of leaking through what is called prompt injection.
They become hidden, privacy breaches waiting to happen.
Mitigation: Applying data masking techniques to the user input is advisable but not a silver bullet. The best mitigation is to secure the entire generative AI stack by deploying the service on an infrastructure the organization controls.
But going in-house means starting from open source models that may not match the best commercial ones out of the box. This calls for fine-tuning them to specific tasks, which opens a new range of privacy risks!
Privacy risk when tuning or adapting models
There is more and more evidence that fine-tuned open source models can match the performance of the best commercial ones at a fraction of the cost. The recent release of Llama-2 under a permissive license triggered a deluge of internal projects mixing it with private data. The opportunity is massive: lower cost, independence from external vendors, stronger security, and potentially better performance on specific tasks too.
But one of the most impressive skills of generative AI — its ability to condense what it was trained on and reuse it later — is also its biggest privacy liability. When fine-tuning a model on private data, it may learn all the private information by heart and happily regurgitate it when prompted.
Mitigation: We will see in a future post how to mitigate this risk by carefully carrying out the fine-tuning process and calling differential privacy to the rescue. Removing identifiers from the fine-tuning phase is usually a good idea too but it may not provide very strong protection.
Privacy risks when using embeddings produced by generative models
A very common use case for generative AI is to use the model to summarize a document into something called an embedding, which is a series of numbers that efficiently summarize the text. The embeddings can then be used to look up documents that match a given search. Those documents can be returned to the user or appended to a given prompt, which is a smart way of adding private context to a prompt.
This creates new privacy risks as embeddings have to be handled with the same level of security as the underlying data (it’s just a powerful summary of it!) but because they are just a series of numbers they may be overlooked. Also, the embedding model itself may be fine-tuned, which has the same privacy risks as mentioned above.
Mitigation: the best rule-of-thumb is to consider that the embeddings are a concentrate of private information. They should be handled accordingly in compliance with all applicable laws.
* * *
When working with generative AI, it’s crucial to prioritize privacy to prevent severe data breaches, regulatory complications, and a loss of customer trust. We will look into these topics and offer practical solutions in the following posts.
Our next post will delve deeper into the specific privacy risks when fine-tuning large language models and how to address them.
Follow us on Linkedin to make sure you don’t miss anything on privacy & GenAI!