In this post, we will show how to use Sarus to train an XGBoost model with differential privacy.
More specifically, we will use a dataset from Kaggle where the goal is to predict if patients should be treated as in-care or out-care patients using some of their vitals. The dataset we use can be found here.
In practice, such data is not easily available to train machine learning models since it relies heavily on personal health data, which is strictly regulated. A guarantee that the privacy of patients is protected during the training and inference phases is thus needed. The gold standard for making sure that no personal data remains in the model is a mathematical theory called differential privacy:
Differential privacy (DP) is a framework that specifically bounds the ability of identifying an individual in a dataset by carefully randomizing model training. This bound takes the form of a privacy budget ε that can be tuned to calibrate the privacy/utility tradeoff. Using Sarus’s DP-XGBoost version, a model can be trained on any data with guaranteed anonymization.
Let’s look at the process in detail.
Step 1: The data owner opens a secure access to the data
Prior to data analysis, the data owner — the hospital for instance — logs onto Sarus and lists the dataset of interest in the interface. This triggers the generation of differentially private synthetic data that will come in handy later on. The data owner can then define who will be able to query this dataset from the differentially private API provided by the software. The original data never moves and is never altered.
Step 2: Preparing the data remotely
As a data scientist that has been granted access to this dataset via the API, we can now start working. The first thing we would like to do is to see the data. Of course, the differentially private API is designed specifically to prevent that from happening. Luckily, we can explore high fidelity synthetic data instead. Here is how it goes.
This synthetic data is very useful to understand the data and perform feature engineering or debugging our code. Of course, we don’t want to train our model on synthetic data but on the real data. We’ll see how Sarus achieves that in a minute.
The last column SOURCE is the label, and the rest are our features. There is a mix of floats, integers and categorical variables.
Step 3: Training the model with DP
To train the model, the procedure is the same as for the standard XGBoost classifier:
That’s it !
For advanced users mastering differential privacy concepts, the clf.fit() method also accepts a target_epsilon parameter which determines how much privacy budget we want to spend during the training. If it is not set, training is performed with the highest ε possible given the global budget set by the data owner.
The other parameters are the same ones as in vanilla XGBoost. However, in DP-XGBoost, their effect somewhat differs as they also impact randomization. By default, Sarus handles the hypertuning.
Congrats ! You just trained your first XGBoost model with differential privacy
As we can see, this can be done seamlessly without fundamentally changing the way data scientists work. The whole process was done without accessing the original data at any point, while letting the user perform all the data exploration needed.
As the original data is never accessed, compliance process is much lighter and faster, allowing the data scientist to start working from day 1 on the project. Even more so that they do not depend on data engineering resources as no bespoke anonymization pipeline is required. Moreover, with this kind of safe access, it’s even easier to ask for additional features, opening up the opportunity to drastically increase the model’s performance.
But how does differential privacy impact the model’s performance ? Let’s import our train and test sets to compare our results with and without it:
As we can see, for an ε>3, we get performances comparable to non-private. Bear in mind that these results are obtained with a relatively small dataset with two thousand records. With a dataset ten times larger and an ε of 1, we would get the same performance that we have here with an ε of 10.
In this tutorial, we saw how to train an XGBoost model while fully conserving privacy with Sarus.
Traditional compliance pipelines would have required to alter the data: removing some columns, binning others… In comparison, Sarus let the data practitioner leverage the original data in full fidelity without privacy leakage risk. There are plenty of benefits to the approach:
- Time: the data governance and anonymization steps are non-existent,
- Data: datasets that would not be available can be leveraged,
- Features: some features that are highly identifying can be included without introducing privacy risk,
- Security: unlike traditional approaches, we have a mathematical guarantee that no personal data was available during training or in the model weights.
Alongside DP-XGBoost, Sarus supports a wide range of data processing tools, from SQL queries to deep learning. They can be leveraged directly in python with the same code that we would use if we had the data on our machine. Data teams can explore data that was behind a regulatory wall before and train models while using their favorite libraries with privacy guarantees.
For more information, please come and say hi!