The protection of grouping keys in the context of differentially private SQL

Differential Privacy
Data Science
SQL
Victoria de Sainte Agathe

Differential privacy (hereafter DP) can be applied to any statistical analysis. Dwork and Roth (2014) explain why histograms (when we count individuals) are sensitive and how we can add noise to prevent someone from inferring the presence of an individual in a dataset using the result of a count. In this post we are going to show how the grouping keys, i.e. the values of the GROUP BY columns in a SQL query, may also create privacy breaches.

For our demonstration, we use the following hospital admission data which collects the occupation of each admitted patient.

The dataset can be accessed through SQL queries. For simplification, only histogram queries are allowed. In order to (incorrectly) have DP guarantees, the data owner has set an interface which adds Laplace random noise to each count.

We adopt the point of view of a malicious person wanting to infer the presence of the French president in the hospital records.

We send the following query:

SELECT occupation, COUNT(*) + Laplace(1/ε) AS count_all FROM admissions GROUP BY occupation;
  • If the French president is in the dataset, the key president of the French Republic is revealed.
  • If the French president is not in the dataset, the key president of the French Republic is not revealed.

By sending this query, we are 100% sure of the presence or absence of the French president in the hospital records.

This issue may be solved by two different techniques:

  • By defining a public list of occupations. For each public item, a noisy count is revealed. If the public item is not in the dataset then the noisy count is just made of noise.
  • By applying the tau-thresholding mechanism before releasing the result.

Releasing public grouping keys

In some cases, the grouping keys may be replaced by some public values. For example, in the case of an hospital admission dataset, we may be replace the ‘age’ column by the integers between 18 and 123.

In our case, the data owner may choose to replace the occupation column by a public list of occupation.

This is equivalent to sending the following query:

SELECT occupations_list.occupation, COUNT(*) + Laplace(1/ε) AS count_all 
FROM admissions RIGHT JOIN occupations_list
ON admissions.occupation = occupations_list.occupation 
GROUP BY occupations_list.occupation;

The result contains the key president of the French Republic only if this key is public regardless of the presence of the French president in the dataset.

The tau-thresholding mechanism

When it is not possible to get public grouping keys, we may use the tau-thresholding mechanism introduced in Korolova et al. (2009).

It consists in revealing a grouping key only if the corresponding noisy count of distinct users exceeds some threshold 𝜏.

This is equivalent to sending the following query:

SELECT occupation, COUNT(*) + Laplace(2/ε) AS count_all 
FROM admissions 
GROUP BY occupation
HAVING COUNT(DISTINCT name) + Laplace(2/ε) > 𝜏;

Following Wilson et al. (2019), the privacy parameter ε is split equally between the two counts.

The threshold value 𝜏 is chosen such that the probability of releasing a category with only one individual is smaller than the privacy parameter δ (which is much smaller than 1/N, where N is the length of the dataset).

This method has three major drawbacks:

  • The probability of releasing the sensitive grouping key is very small but not null.
  • The δ privacy parameter does not compose well. In fact, while the ε parameter increases with the square root of the number of compositions, the δ parameter increases linearly (theorem 3.20 from Dwork and Roth, 2014).
  • Since the privacy budget ε is equally split between the two counts, for a specified utility on the COUNT(*) result, we have to spend twice the budget of the first method.

In the Sarus SQL engine we use both the approaches. For each column in a dataset, the data owner may specify that it can be replaced by a public list. If it cannot, then we invoke the tau-thresholding mechanism before revealing the grouping keys.

Bibliography

The Algorithmic Foundations of Differential Privacy, Dwork and Roth, 2014

Releasing search queries and clicks privately, Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas, 2009

Differentially Private SQL with Bounded User Contribution, Wilson et al., 2019

About the author

Victoria de Sainte Agathe

Data Scientist R&D @ Sarus

Ready?

Ready to unlock the value of your data? We can set you up in no time.
main.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Shell

Subscribe to our newsletter

You're on the list! Thank you for signing up.
Oops! Something went wrong while submitting the form.
128 rue La Boétie
75008 Paris — France
Resources
Blog
©2023 Sarus Technologies.
All rights reserved.