Annals of personal data

Understanding privacy with rainbows and magic wands!

A colorful llustration of privacy-preserving data transformations

Differential Privacy
Data Governance
Maxime Agostini

There is often some confusion on whether the processing of personal data yields personal data or not. Here, we propose a colorful guide that helps study most situations.

Rainbow data and magic wands

We will use rainbows for anything that does or may contain personal information. When we are sure there is no personal data in an object, we will use plain white.

Arrows represent data transformations. It can be rainbow-colored if it encodes personal data. For instance, looking up a patient database and attaching the vaccination status to each row of the input data is definitely rainbow-colored.

Now we can build a very simple logic:

With no personal data in the input and in the transformation, the output is just as safe
For transformations with no specific guarantees, it is safer to assume the output include personal data
If the transformation itself can have traces of personal data, the output should be considered personal data until proven otherwise

It’s not that rainbow outputs always include personal information but it is quite likely that there are traces of the input in the output. So unless one proves otherwise, considering it personal is the most reasonable assumption.

Luckily, some transforms can guarantee that whatever the input, there won’t be personal data in the output. For more on it, you to check out Differential Privacy literature (a good intro). For now we just need to know that such transforms do exist. We represent them with a magic wand. We can then extend our logic with the following rule:

In general, only transforms that have a specific privacy property can output information that can be deemed anonymous.

Let’s use those simple tools to analyze a few common scenarios.

What color is an AI model?

An AI model is the result of a complicated transformation of input data during the training phase. By default, it should be considered rainbow-colored: there are many known examples of membership attacks on AI models.

When training without privacy guarantees, one should assume the model retained traces of the original personal data

If the AI model has been trained with a magic transformation (e.g. differentially private libraries like Tensorflow Privacy or Opacus), we have the guarantee that the model does not reveal personal information.

Differential-privacy is one way of guaranteeing the learned model cannot reveal personal data

This helps consider the question of whether AI models belong to the users. If the AI model was not generated with privacy protection, this claim is very natural. But if the model was trained by making sure it did not depend on one individual, it becomes harder to argue. In a sense, the model is “true” irrespective of each individual just like medical research is “true” beyond the patients that were enrolled in the clinical trial.

What color is personalization?

Personalizing the user experience, whether it is a medical treatment or displaying invasive ads, is a process that takes personal data as input and outputs a recommendation, typically using historical data from individuals. The process can be an AI model or something much simpler. But either way, the output is going to be personal, we’ll paint it rainbow! When a department store starts sending coupons for baby items, it clearly reveals what is known about the individuals.

A recommendation is personal information, however general it may be, it tells us something about the input data that triggered such decision

What color is synthetic data?

Synthetic data is typically generated from an AI model trained on personal data. It can be rainbow-colored or not depending on how the training is done. The final process of generating data from this model requires inputting random values (clearly not personal) and feeding them into the model to output synthetic records.

The resulting synthetic data will therefore be the same color as our transform.

If the synthetic data generation process itself may have personal information, we should assume that its outcome also does
If the synthetic data generation process is provably anonymous, we can be confident that the output data is safe

About the author

Maxime Agostini

Cofounder & CEO @ Sarus

Ready to put all of your data to work?

Get in touch, you'll be up and running in no time.
Get started


Subscribe to our newsletter

You're on the list! Thank you for signing up.
Oops! Something went wrong while submitting the form.
32, rue Alexandre Dumas
75011 Paris — France
©2022 Sarus Technologies.
All rights reserved.