Technology

Differential Privacy

One of the biggest challenges corporations face to unlock the full value of their data assets is to address regulatory and security risks relating to private information. Past years witnessed the advent of Privacy Enhancing Technologies (PETs), aiming to reconcile the vast data needs for Analytics and AI projects, and privacy protection. Differential Privacy, one of the PETs, was invented in 2006 and has emerged as the best candidate to define what "anonymous" may mean, on scientific grounds. Unlike most PETs that focus on how to carry out a data processing task without revealing data during processing (“input privacy”), Differential Privacy offers protection to sensitive data by limiting the personal information that is revealed by the output of a computation, for example specific statistics or a trained machine learning model (“output privacy”). Hence, it allows to share insights about a group of people without putting at risk personal information of every single individual present in the collected data.
‍
Differential Privacy is based on the addition of statistical noise to computation results. The noise introduces a level of uncertainty limiting how much information about one individual may be revealed. Such noise shall be sufficient to hide the effect of one single individual, but not excessive, to keep the result accurate.
‍
Metaphorically, you can imagine a beautiful movement of a flock of birds: murmuration. If you take out or add one bird, you will not even notice a difference in the entire movement. Differential Privacy is an ideal tool for studying the movements of groups, without revealing any individual information!

Differential Privacy allows to irreversibly prevent re-identification no matter what additional information one may possess, today or tomorrow, unlike legacy data protection methods such as data masking or pseudonymization. Imagine a statistics - average salary of employees - is published every year. If you know which one person left the company this year, you can calculate their salary. This simple example illustrates how published statistics can lead to a reconstruction of a substantial part of personal information on which such statistics were calculated. To learn more, we recommend this paper about a successive reconstruction attack on the 2010 US Census: The 2010 Census Confidentiality Protections Failed, Here's How and Why. Differential Privacy prevents a possibility of such an attack, which makes it a recognized gold standard in privacy protection.

This page is a list of useful resources about Differential Privacy.

If you have more questions on Differential Privacy and how it is implemented in Sarus - contact us!

First reads on Differential Privacy

Different public agencies and organizations have published guidelines on PETs. If you are looking for some general information as a first step to discovering Differential Privacy and other PETs, we recommend:

Differential Privacy and Machine Learning

At its core, DP is about adding noise to a statistical computation.In the case of machine learning, the non-deterministic element can be implemented in different ways, such as adding noise to the parameters of the model during training, randomizing between various models trained on different user bases, or training models on synthetic datasets that have been produced with Differential Privacy.
The most widely used approach in the case of deep learning is DP-SGD ([1607.00133] Deep Learning with Differential Privacy). In this approach, random noise is added to the gradients of the stochastic gradient descent. It ensures that the updates to the model parameters are not precisely determined by any individual training data point. This prevents an attacker from reverse-engineering the model and extracting sensitive information.
For a deep-dive in differentially-private machine learning, here are useful resources:

Differentially-private machine learning is a perfect framework for creating privacy-safe synthetic data. Injecting statistical noise during the training process allows to memorize statistical properties and patterns of the dataset while ensuring no private information about individual data points has been learned by the model. There are many works on privacy-preserving synthetic data generation (beyond the huge production on generative AI which is at the foundation of synthetic data generation). Here are few remarkable ones:

On how Google used privacy-preserving synthetic data for data discovery: Augenstein et al 2019 - Generative Models for Effective ML on Private, Decentralized Datasets.
On PATE-GAN technique for differentially-private: PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees.
On using LLMs for tabular data generation: Privately generating tabular data using language models.
On using transformer for complex data generation: Generative Modeling of Complex Data.
On the importance of making sure the training of a model (and in particular a synthetic data model) protects privacy: Extracting Training Data from ChatGPT.
On the fragility of commonsense techniques for private synthetic data: Ganev et al. 2023 - On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against “Truly Anonymous Synthetic Data.

Real-world implementations of Differential Privacy

Differential Privacy has been around for almost 20 years and has now many real-world applications, from consumer electronics to national statistics. The biggest companies and public agencies are implementing Differential Privacy to ensure data protection. Today Differentially-private mechanisms are used by the US Census Bureau, Wikimedia and by companies like Microsoft, Apple or LinkedIn.
Here are some details about such implementations:

The US Census Bureau adopted differential privacy for the 2020 Census to protect respondents' privacy while still providing accurate population statistics.
Apple has been a pioneer in implementing differential privacy across its products and services. For example, in iOS, macOS, and other platforms, Apple utilizes differential privacy techniques to collect usage data from its users while preserving their privacy. This data helps improve certain features without compromising user privacy.
Microsoft has integrated differential privacy into various products and services, including its cloud platform Azure and productivity tools like Microsoft Office. By applying differential privacy techniques, Microsoft can collect telemetry data and usage statistics from its users while preserving their privacy. This data helps Microsoft improve product performance, identify and fix issues, and enhance user experience without compromising privacy.
LinkedIn implements differential privacy to manage and protect user data while providing insights for audience engagement and analytics.
The Wikimedia Foundation's Privacy Engineering Team is actively exploring the application of differential privacy across various domains within the Wikimedia ecosystem. This includes documentation of project statuses, design decisions, and potential future directions for differential privacy at WMF.

Differential Privacy and Privacy Regulations

While privacy regulations such as GDPR in Europe and CCPA in California emphasize the importance of protecting personal data, they do not explicitly require the use of any specific technology.
However, privacy regulators and authorities around the world demonstrate more and more interest in Privacy Enhancing Technologies, including Differential Privacy, recognizing its potential to provide strong privacy guarantees while enabling valuable data analysis and research. Such technologies can allow sharing only anonymous results - statistics or trained ML models - effectively mitigating the risk of re-identification, or to secure access to personal data for analytics purposes. Hence, a number of official frameworks prescribing the use of the Differential Privacy have emerged:

French CNIL published AI how-to sheets, which emphasize the importance to design AI systems ensuring data protection by design, including by using Differential Privacy during the ML learning phase.
US President Joe Biden published an Executive Order “on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence”, followed by NIST’s “Guidelines for Evaluating Differential Privacy Guarantees”, qualifying Differential Privacy as “currently the best known method for providing robust privacy protection against known and future attacks, even in the face of multiple data releases”.
UK’s ICO published Guidances on the Privacy-Enhancing Technologies, including DP, providing general information about the technology and outlining the associated risks.

Open source Differential Privacy libraries

Implementing differential privacy requires a deep understanding of both privacy principles and practical implementation techniques. Hence, it can be a challenging task, especially for someone who is not a privacy expert. Therefore, unless you have a dedicated team of privacy specialists, it is recommended to use renown libraries:

Google's Differential Privacy Library: Developed by Google, this library offers a robust set of tools for implementing differential privacy in data analytics and aggregation. It provides a comprehensive suite of algorithms that are optimized for performance and usability.
IBM's Diffprivlib: The Differential Privacy Library by IBM is a general-purpose library for experimenting with, investigating, and developing differential privacy applications in Python. The library covers a wide range of differential privacy techniques, including the Laplace mechanism, the Gaussian mechanism, and differentially private stochastic gradient descent (DP-SGD). It's designed to be easy to use and flexible for implementing a wide range of differential privacy algorithms and mechanisms.
Opacus: Originally developed by Facebook, Opacus is a PyTorch library that makes it easier to train machine learning models with differential privacy.
OpenDP is an open-source library developed by the Harvard University Institute for Quantitative Social Science (IQSS) that provides tools for implementing differential privacy in data analysis workflows. It aims to make it easier for researchers, data scientists, and developers to incorporate differential privacy techniques into their projects. OpenDP offers a range of functionalities for analyzing data while preserving privacy, including mechanisms for adding noise to data, estimating privacy risks, and quantifying privacy guarantees.
OpenMined’s PySyft: PySyft is a Python library for secure and private deep learning. It extends PyTorch and other deep learning tools with capabilities for federated learning, secure multiparty computation, and differential privacy to enable privacy-preserving data science operations.
Qrlew is an open-source library developed by Sarus Technologies that can turn SQL queries into their differentially private (DP) equivalent. It takes SQL – the universal language of small and big data analytics – as input, so there is no new language or API to learn. It returns DP SQL queries that can be executed at scale on any SQL datastore.
TensorFlow Privacy: TensorFlow Privacy is an extension of the TensorFlow machine learning framework that provides tools for training machine learning models with differential privacy. It includes functions for adding differentially private noise to gradients during model training and supports various differential privacy mechanisms, such as the Gaussian mechanism and the Laplace mechanism.

Differential Privacy

First reads on Differential Privacy

Differential Privacy and Machine Learning

Real-world implementations of Differential Privacy

Differential Privacy and Privacy Regulations

Open source Differential Privacy libraries

Subscribe to our newsletter

Sarus tech

Resources

Company