XGBoost is one of the most popular gradient boosted trees library and is featured in many winning solutions on Kaggle competitions. It’s written in C++ and useable in many languages: Python, R, Java, Julia, or Scala. It can run on major distributed environments (Kubernetes, Apache Spark, or Dask) to handle datasets with billions of examples.
XGBoost is often used to train models on sensitive data. Since it comes with no privacy guarantee, one can show that personal information may remain in the model weights. Differential Privacy (DP) is the right framework to address this privacy risk. We are happy to announce the first open source differentially-private fork of XGBoost. The project can be found on Github.
How it works: DP boosted trees
In a 2019 paper, Li et. al. develop a method for DP boosted trees. Sarus DP-XGBoost builds upon their ideas with some improvements.
The key improvements are the following:
- We use the approximate learning method of XGBoost, which is able to handle huge datasets, as opposed to the small-scale basic exact method
- We reduce the noise added by leveraging the min_child_weight parameter in XGBoost. This parameter is used normally to reduce overfitting and actually allows for a dramatic reduction in privacy loss.
- We leverage the subsample parameter to benefit from privacy amplification by subsampling of Balle et. al. (2018).
- We take advantage of XGBoost design to run in distributed mode on a variety of platforms such as Kubernetes, Dask or Apache Spark.
The method is described in detail in a technical report.
Installing Sarus DP-XGBoost
To use the Python library, you can simply install the package with pip.
If package wheels are not available for your platform you’ll need to have cmake installed.
That’s it, now you can use XGBoost with our differentially-private tree learning algorithm in Python!
We have included a regression and classification Python example in the repo. For now, Sarus DP-XGBoost is available for Python and Spark (Scala). For differential-privacy, the upper and lower bounds of each feature of the input matrix must be known (these bounds should be public or evaluated with differential-privacy themselves). The labels should also be in [-1 ,1].
Apart from this preprocessing, the usage is very similar to the usual XGBoost. For now, DP-XGBoost supports regression and binary classification and we plan on adding more objective functions in the future. Here’s a code snippet that shows DP-XGBoost in action.
DP-XGBoost parameters and result
The most important parameters for differential privacy are the privacy budget per tree 𝜀: dp_epsilon_per_tree, and number of trees num_boost_round. The subsample frequency and min_child_weight parameters can be used to improve out-of-sample accuracy and boost privacy.
DP-XGBoost returns the trained model as a JSON object. This object contains the trained model as XGBoost does, plus the parameters of all the basic DP mechanism (Exponential and Laplace mechanism) used during the training. The mechanisms are composed using basic composition (the privacy losses: 𝜀 are summed and thus overall privacy consumption increases linearly with the number of mechanisms), but if you want, you can simply compute privacy loss using your own privacy accountant (RDP, Gaussian-DP or f-DP) on the basis of the mechanisms reported in theJSON output. Doing so, if you can tolerate a small 𝛿, you can achieve a much lower growth of privacy loss: as the square root of the number of mechanisms instead of linear.
If you plan to publish the trained model, be aware that the full JSON output in itself, is not DP! This means you should publish only the split conditions and leaf values of each tree.
We’re thrilled to contribute to open-source DP initiatives and build on the awesome work of the XGBoost team! 😎