Differential privacy

Sarus just released DP-XGBoost

Differentially Private Boosted Trees at scale

Differential Privacy
Open Source
Joan Gonzalvez

XGBoost is one of the most popular gradient boosted trees library and is featured in many winning solutions on Kaggle competitions. It’s written in C++ and useable in many languages: Python, R, Java, Julia, or Scala. It can run on major distributed environments (Kubernetes, Apache Spark, or Dask) to handle datasets with billions of examples.

XGBoost is often used to train models on sensitive data. Since it comes with no privacy guarantee, one can show that personal information may remain in the model weights. Differential Privacy (DP) is the right framework to address this privacy risk. We are happy to announce the first open source differentially-private fork of XGBoost. The project can be found on Github.

How it works: DP boosted trees

In a 2019 paper, Li et. al. develop a method for DP boosted trees. Sarus DP-XGBoost builds upon their ideas with some improvements.

The key improvements are the following:

  • We use the approximate learning method of XGBoost, which is able to handle huge datasets, as opposed to the small-scale basic exact method
  • We reduce the noise added by leveraging the min_child_weight parameter in XGBoost. This parameter is used normally to reduce overfitting and actually allows for a dramatic reduction in privacy loss.
  • We leverage the subsample parameter to benefit from privacy amplification by subsampling of Balle et. al. (2018).
  • We take advantage of XGBoost design to run in distributed mode on a variety of platforms such as Kubernetes, Dask or Apache Spark.
Design of a single DP tree. The privacy budget is split into 3 parts for the different mechanisms. XGBoost quantiles sketches are fed with values drawn from a DP histogram. The best splits are selected with an exponential mechanism as in Li et. al. (2019), and a Laplace mechanism is used to compute leaf values.

The method is described in detail in a technical report.

Installing Sarus DP-XGBoost

To use the Python library, you can simply install the package with pip.

pip install dp-xgboost

If package wheels are not available for your platform you’ll need to have cmake installed.

That’s it, now you can use XGBoost with our differentially-private tree learning algorithm in Python!

Example Usage

We have included a regression and classification Python example in the repo. For now, Sarus DP-XGBoost is available for Python and Spark (Scala). For differential-privacy, the upper and lower bounds of each feature of the input matrix must be known (these bounds should be public or evaluated with differential-privacy themselves). The labels should also be in [-1 ,1].

Apart from this preprocessing, the usage is very similar to the usual XGBoost. For now, DP-XGBoost supports regression and binary classification and we plan on adding more objective functions in the future. Here’s a code snippet that shows DP-XGBoost in action.

# the DMatrix is constructed with feature bounds for DP
# feature_min/max are the lists of min/max values for each feature of the training dataset
dtrain = xgb.DMatrix(trainX, label=trainY, feature_min=feature_min,
    feature_max=feature_max) 

# XGBoost params
paramsDP =  {'objective': obj,
        'tree_method':'approxDP', # this is Sarus XGBoost DP tree updater 
        'dp_epsilon_per_tree': dp_per_tree, # privacy budget per tree
        'max_depth': 6,
        'learning_rate' : 0.3,
        'lambda' : 0.1,
        'base_score' : 0.5,
        'subsample' : 0.1,
        'min_child_weight' : 200,
        'nthread' : 4}

bstDP = xgb.train(paramsDP, dtrain, num_boost_round=n_trees) 

# we can now make and publish differentially-private predictions
predictionsDP = bstDP.predict(dtest)

# note that publishing the whole model is not DP !
# only the split conditions & leaf values are (and thus predictions)

DP-XGBoost parameters and result

The most important parameters for differential privacy are the privacy budget per tree 𝜀: dp_epsilon_per_tree, and number of trees num_boost_round. The subsample frequency and min_child_weight parameters can be used to improve out-of-sample accuracy and boost privacy.

DP-XGBoost returns the trained model as a JSON object. This object contains the trained model as XGBoost does, plus the parameters of all the basic DP mechanism (Exponential and Laplace mechanism) used during the training. The mechanisms are composed using basic composition (the privacy losses: 𝜀 are summed and thus overall privacy consumption increases linearly with the number of mechanisms), but if you want, you can simply compute privacy loss using your own privacy accountant (RDP, Gaussian-DP or f-DP) on the basis of the mechanisms reported in theJSON output. Doing so, if you can tolerate a small 𝛿, you can achieve a much lower growth of privacy loss: as the square root of the number of mechanisms instead of linear.

If you plan to publish the trained model, be aware that the full JSON output in itself, is not DP! This means you should publish only the split conditions and leaf values of each tree.

We’re thrilled to contribute to open-source DP initiatives and build on the awesome work of the XGBoost team! 😎

See the code and our technical report.

About the author

Joan Gonzalvez

Research Scientist

Ready to put all of your data to work?

Get in touch, you'll be up and running in no time.
Get started
main.py
1
2
3
4
5
6

Shell

Subscribe to our newsletter

You're on the list! Thank you for signing up.
Oops! Something went wrong while submitting the form.
32, rue Alexandre Dumas
75012 Paris — France
©2022 Sarus Technologies.
All rights reserved.