There are two ways to think about anonymization:
- Luck-based anonymization: The process of removing everything that can be easily used to identify someone. For what’s left, we assume it would be bad luck that someone would use it to extract personal information.
- Math-based anonymization: The process of transforming the information in a way so that it is mathematically impossible to extract significant information on individuals, no matter what one may already know or do.
Most traditional approaches are variations around luck-based anonymization, sometimes unknowingly. We will explain how to identify luck-based anonymization and explain why we believe math-based anonymization should prevail.
To anonymize a dataset, intuition tells you to remove what may be used to identify someone. You have to provide an answer to the question “what are the odds that one would use parts of the data to identify someone?”.
What are the odds one would identify an individual by their name? Very high, you should delete them. And so on for each field (email address? — high again. — home address? — High. — IP? High enough.).
Looking at fields in isolation is not sufficient, the same question applies to any combination of fields too. It means answering questions such as “What are the odds that someone knows that there is a pediatrician who is 42 and lives on that street? -🤷”, if it is high this combination must go.
As long as a row of data is unique someone may know what makes it unique (e.g. age+street+profession) and use this fingerprint to identify this individual. Such data cannot be deemed absolutely anonymous. GDPR is very explicit about this: “data […] which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person (GDPR, recital 26)”. In absence of an absolute guarantee, the best we have is to hope the odds are low.
Netflix learned it at great cost. Yes, it was bad luck that there was a public database where users had revealed some movies they had watched back in 2008 (IMBD in this case). But, this is the problem with luck-based anonymization, it’s a lottery and the odds are not in your favor: there is more and more public data out there. Think of how much more information is available today compared to 2008. Instagram, Facebook, Linkedin are just the obvious ones. There is way more information collected daily in corporate data lakes around the world. And when playing the luck-based anonymization lottery, you are playing against all the information that might be made available in the future too! Whether it comes from voluntary disclosure, data breaches, or also new ways of connecting information. It is becoming urgent to stop making assumption about the additional information that may be available elsewhere.
If releasing unique rows is doomed, one may think that releasing aggregates, where each row combines several individuals, would be the solution. They do push the odds in your favor they remain a lottery. What are the odds that someone would know a slightly different aggregate that differs by one individual? Say, the same aggregate released the day before? Or the same statistics in the next national census?
Not only does aggregation drastically reduce data utility but it also falls short of providing a strong anonymization guarantee and remains a form of luck-based anonymization.
In reaction to luck-based anonymization, researchers have looked for new approaches that don’t depend on what the recipient may know or do. You should not have to answer questions such as “what are the odds that someone knows the age and the date a person is discharged from the hospital?” or “what are the odds that someone knows the name of the only person that has been admitted at the hospital six times in the past year?”. Because quite frankly no one knows.
It took decades before Differential Privacy finally emerged as the best way to provide such guarantees. It was formally introduced by Cynthia Dwork who won the Godel Prize for her work. The idea is that when you reveal information about a group of individuals, the recipient should not be able to learn anything significant about any given individual in that group, whatever they may already know. Keeping anything unique about an individual in the output data is a trivial violation of this principle. One can show that revealing anything without adding some randomness would also violate this principle.
Looking at each row separately and removing what can lead to identification does not work anymore. It calls for a new way of operating on sensitive datasets. Instead of releasing datasets that are anonymous “if you are lucky”, it makes more sense to offer an API to compute on the original datasets with math-based anonymity baked into every interaction.
The good news with this approach, is that not only does it provide provable guarantees, but it also allows you to keep the original data in full. The exact birth date may be useful for diagnosis for newborns. The full address including floor level and all the details may matter for delivery optimization. Similarly, having the most precise GPS traces will make a difference when designing the city mobility master plan.
If you are crafting data by hand, if you know precisely and trust who will use it, and if you know what they will do with it, luck-based approaches can be deemed sufficient. The data would not be strictly anonymous under GDPR but it may be sufficiently protected. But if you want to enter the industrial era of data, scale up data workflows, and meet modern — and future! — compliance standards, you definitely want to rely on a more robust notion of anonymization.
Sarus was started to make the math-based approach 10x more practical and powerful than the luck-based approach. We believe that it should be no one’s responsibility to answer every possible “what are the odds” question that stand in the way of using personal data for research and innovation.