Today, accessing sensitive data for ML requires a lot of work, processes and risk!
Traditionally, making sensitive data accessible to data scientists required defining, applying, and maintaining a data masking strategy. This is a multi-month process requiring among others the compliance team to assess the risk; the implementation then often suffers from an engineering resource bottleneck, with the risk that in the end, the pseudonymized data is no longer useful for the analysis, especially if joining users is needed (or otherwise, the pseudonymised data may not be as protected as expected…).
Some privacy libraries offer to use differential privacy so that instead of granting access to the data scientist, an engineer could prepare the insights and only share insights that are safe from a privacy perspective. Those libraries are designed for engineers that have full access to data; they require the analysts to ask the engineer to produce insights for them. This may work if the analyst has a very precise idea of what calculation they are interested in. But it pushes even more burden on overstretched data engineering resources. On top of that privacy libraries usually come with new complex code syntaxes and any misuse of their parameters can quickly destroy all privacy guarantees.
So on one hand you have data masking that is compliance heavy, requires engineer work, and has significant residual risk. On the other hand you have privacy libraries that solve the residual risk but at the cost of even more burden on engineers and a very rigid way of working for analysts.
Not really satisfying.
Enter Sarus private learning SDK: an easy-to-use python SDK reconciling data privacy and the data scientist exploration work
At Sarus, we wanted to reconcile data privacy and the depth and variety of the data scientist exploration, analysis and modeling work, while alleviating the data engineers from the anonymization burden. This is why we’ve built the Sarus private learning SDK: with this new version of our SDK, a data scientist can manipulate a remote dataset any way they want while using their usual data science tools — pandas, numpy, scikit learn… — ; all that with full security thanks to on-the-fly, automated and frictionless privacy compilation.
The Sarus PL SDK can be installed with the usual ‘pip install sarus’. Once the library is installed, the data scientist selects the datasets they want to analyze. They can then use their favorite python libraries to work with the remote dataset, writing the exact same code they would write if the data was on the local filesystem.
The only thing that the data scientist needs to change is the import line because the actual libraries assume the data is local whereas in our case it remains remote. So we’ve wrapped the library so everything else remains the same.
That’s it! The rest of the code is the exact same as if one was manipulating data without Sarus, except that the data is safe!
The magic under the hood
The data, instead of being stored and manipulated locally, actually remains in the secure storage, behind the Sarus gateway. Each time the data scientist executes some code, it is captured by the Sarus SDK and registered into a graph of operations. The graph execution is lazy until we reach a line of code that expects an output (e.g.: extract of rows, aggregates, model weights, model performance etc.). In this case, the SDK will send the graph of operations to the Sarus gateway for execution.
But, you probably wonder what you get when you ask for some data rows, with .head() for example. You actually get synthetic data. Because Sarus generates synthetic data with DP, the compiler always has the option to run the graph of operations on synthetic data to get a DP output. This way, the data scientist gets a sense of the source data as well as any dataframe derived from it. The fact that the actual data is never accessed does not get in the way of knowing what the columns are, their types, their ranges, their distribution, how to join them, or how the code we are writing performs… This is precious to make the data scientist’s experience seamless.
Sarus private learning SDK already supports most of numpy, pandas, sklearn, xgboost and tensorflow primitives. They can be used and combined freely. And the best part is that every library that relies on them for interacting with data will also work natively.
We are super excited about the ocean of new possibilities the Sarus private learning SDK opens up to all data scientists. We can’t wait to have you try it! If you want to do so, just let us know!