DP-SGD introduces new hyperparameters to tune in order to keep good privacy guarantees while insuring that the model learns the right features. Those hyperparameters are the gradient clipping value and the noise multiplier.
In the original implementation of DP-SGD by [Abadi 2016], those parameters were fixed throughout the training. However, there has been recent efforts to make those parameters evolve at each step to fit each iteration’s needs. The benefit is twofold.
First, using exactly the privacy budget needed can help improving the accuracy/privacy trade-off and thus improving the utility of DP-SGD.
Second, the choice of hyperparameters can have a great impact on the performances of the trained models, but those parameters can be tricky to set [Avent 2019]. Worse, contrary to non-private settings, here one cannot simply try multiple settings and take the best one. Some methods exist [Talwar 2019, Chaudhuri 2011], but they do not guarantee the optimality of the selected parameters and come at an increased privacy cost. While this is still an open subject, adaptive variations of DP-SGD might reduce the impact of the initial parameters’ choice and lead to better performances of the overall pipeline.
In particular, the clipping, noise and learning rate have attracted attention on this subject, either separately or conjointly.
In DP-SGD, the clipping parameter is used to bound the sensitivity of each gradient. A value too low could destroy most of the information and could completely change the direction of the gradient step, but a high clipping bound would overvalue the sensitivity of the queries and require more noise to be added.
Setting an adaptive gradient clipping was first proposed in [McMahan 2017] and further developed in [Thakkar 2019] in the context of federated learning. This approach enables the introduction of per-layer clipping, each layer being clipped at a different value, which is difficult to implement if all parameters have to be set by hand without looking at the data. In particular, it was shown that weight layers and bias layers need completely different clipping values in order to be optimal. The gradient norms also tend to evolve with time, from large values at the beginning of training to smaller norms as the model gets closer to the optimum [van der Veen 2018].
First, note that in a DP set-up, using the gradient norms cannot be done without incurring a privacy cost. In all the methods presented here, part of the privacy budget of each step is used to adapt these norms.
In [van der Veen 2018], the clipping bound for step t is simply proportional to the (DP estimate of the) gradient norm at t-1. The scaling factor is proposed to be set to a value slightly larger than 1 in order to give some margin of error to the algorithm.
[Thakkar 2019] proposes to fix C as a quantile of the observed norms and use gradient descent to make C converge to this value:
Since this loss is minimum when Pr(X<C) = γ, convex and bounded by 1, online gradient descent can be applied to C. On a sample of size n, with β the proportion of elements lower than C, the gradient of the loss is simply β-γ. The clipping bound C can thus be updated to C-η(β-γ) for a given learning rate η. Here, γ is a hyperparameter, which should be set to a value close to 1 in order to capture most of the information.
[Pichapati 2019] introduces a different approach. Instead of looking for a good clipping value, the gradients are centered and standardized, clipped to 1, added noise with sensitivity 1 and then transformed back to their original mean and variance. The problem is thus transferred to estimating the mean and standard deviation, which can be inferred from the values of the gradient at previous timesteps. Furthermore, this can be done for each parameter independently, pushing the adaptivity even further.
Finally, it should be noted that some versions of DP-SGD introducing adaptive privacy budget also integrate the clipping bound to their methodology as we will see later (see [Chen 2020] for instance).
To sum up, gradient clipping can be adapted:
- by evaluating the previous gradient’s norm
- using gradient descent toward a given quantile
- by evaluating the mean and standard deviation of the gradients
Adaptive learning rates is not necessarily a DP-specific problem. In particular, it has attracted attention in the context of large-batch training [Goyal 2017, You 2017, You 2019]. Another recent approach [Bernstein 2020] proposes a new optimizer that changes the learning rate as function of the model’s weights and their gradients.
All those approaches try to fulfill one goal: adapt the learning rate to the level of trust in the update. The more confident we are in the gradient’s direction, the larger the learning rate can be set. This is an issue that is inherent to DP-SGD: as we clip the gradients and add noise, their directions can be completely lost in the sanitation process.
[Koskela 2020] proposes to estimate this confidence specifically in the context of differential privacy. To do so, at each step, they compute the difference between the gradients that would be obtained by doing either one step at a given stepsize or two steps at half-stepsize. The learning rate is then updated accordingly. If the noise does not suppress all the signal contained in the gradients, the two values should be close and the learning rate is thus increased. On the other hand, if the signal-to-noise ratio is too low, the two values will be further apart and the learning rate will be decreased to capture it.
Taken on its own, the learning rate can thus be optimized either:
- by computing it as a function of the gradient norm
- by comparing two ways of estimating the gradient update
While [Koskela 2020] focused on using its confidence indicator to update the learning rate, other papers prefer to change the amount of noise added to the gradient, and ultimately the privacy budget used at each step. In practice, often both the learning rate and noise are adapted simultaneously.
The need for adaptive privacy budget comes from the same observation as for adaptive clipping: gradient norms are larger at the beginning than down the road, so the privacy needed to keep information should be changed to keep a similar signal-to-noise ratio.
The first paper to have proposed adaptive noise is [Lee 2018]. Their method consists, given a current budget ε and a list of possible step sizes containing 0, in using the NoisyMin algorithm from [Dwork 2014] to find the stepsize α that minimizes the loss. If α is 0, then the privacy budget is increased and a new stepsize is searched, until depletion of the total allocated budget.
[Chen 2020] continues on this idea, but using the Sparse Vector Technique from [Dwork 2014] to search for the optimal stepsize, which reduces the privacy spending to a constant instead of linear in the number of possible stepsizes. The algorithm is also given a maximal number of iterations. If it is attained, the privacy budget is increased to allow a more precise gradient to be computed. In addition, the clipping bound can be decreased at the same time, since both values are representative of the gradient norm.
[Yu 2019] has proposed simpler approaches that have the benefit of not needing to spend privacy to estimate if the budget should be increased. In case of an available public dataset, the validation accuracy is monitored at each step. If it stops improving, the budget is increased accordingly. If no public dataset can be used, one can try predefined schedules such as time-based, exponential or polynomial decay.
The per-step privacy budget can thus be set:
- by estimating the optimal step size and increasing the budget if it is found to be 0
- using a public dataset similar to the one studied
- with a predefined schedule
In this article, we have seen that some research efforts have been spent on making DP-SGD adaptive on several dimensions. Hopefully, this will help improve the accuracy/privacy trade-off that remains problematic for a number of real-world applications.
As noted in [Tramèr 2020] and as we can see in the table above, the privacy spent on adaptivity can be costly overall and do not compare too well to perfectly tuned vanilla implementations. However, vanilla algorithms often exhibit shiny results but fail to take into account the hyperparameter tuning cost, which severely reduce their performances as exposed in [Koskela 2020]…