As the implementation of Large Language Models (LLMs) in production expands, they’re increasingly being leveraged for tasks like text classification and question answering. LLMs are known to be few-shots learners which means that you can specialize an LLM’s behavior to a given task with minimal examples. Methods to adapt an LLM for specific tasks include in-context learning (ICL), where the model specializes to a task using few examples provided in a prompt without altering its internal parameters, and fine-tuning, which adjusts the model’s parameters for a specific task.
However, employing a prompt-specialized LLM to do ICL in production poses challenges data privacy challenges. At inference time, the base LLM is given a prompt that is the concatenation of hidden examples, known only to the model owner, and the user’s query. In this blog post, we will demonstrate that the hidden part of the prompt is vulnerable to membership inference attacks.
Let’s examine a scenario where an organization deploys an LLM for text classification. This LLM discloses only class probabilities as a protection to avoid divulging the prompt directly. Membership Inference Attacks are a kind of privacy attacks that aim at guessing if a particular example was used in the training corpus of an LLM or, in our case, in the prompt examples.
In our experiment, we assume an attacker has only black-box access to a public API and cannot view the model parameters. Drawing from earlier research, our aim is to create a classifier that, based on target label probabilities, can discern between members (user queries present in the hidden part of the prompt) and non-members.
Using OpenAI’s ChatGPT 3 API for authenticity, we employed three text classification datasets (dair-ai/emotion, sst2 and poem_sentiment) to construct few-shots classification prompts from arbitrarily chosen samples. The OpenAI davinci model was then tasked with predicting the log probabilities for both member and non-member examples. Results were constrained to one token, and through logit bias, only projected possible class tokens.
We perform 10 runs of such predictions for each dataset. Below we plot the target label probability distribution of members and of non members. As we can see from the graph, the probability distributions vary greatly, which gives room for an attacker to build a classifier that does much better than random guessing to distinguish between members and non members.
To give an estimate of how well the attacker can distinguish members from non members, we plot the ROC curve of a membership classifier based on the previous probabilities. As we can see the attacker can very easily distinguish between members and non members. Given some knowledge of prompt examples, the attacker could build such a classifier. He might also deploy anomaly detectors or probability tests to discern if a specific example is part of the prompt.
In conclusion, this exploration highlights a significant vulnerability: in typical setups using OpenAI’s API and few-shot prompting, prompt examples can be discerned with ease. As such, it’s critical to implement measures, such as removing Personally Identifiable Information (PII) or using techniques like fine-tuning the model with Differential Privacy (DP), to enforce robust privacy protections.