Because it’s always nicer to try rather than just read about a product, we’ve just released a playable demo of Sarus! This post walks you through the whole workflow! Feel free to test any part!
Demo set up
Let’s imagine, a data scientist wants to study a dataset made of census data. This dataset is very sensitive and opening direct data access may take months of compliance and anonymization processes. Instead, a data admin can set up Sarus so that the data scientist can work on the data without accessing it.
In this post, we let you put yourself in the shoes of either the data scientist or the data admin.
We first show you how, as a data scientist, you can easily train a machine learning model or do BI analysis from your usual tools with Sarus, without direct access to the data.
Then, we take a look under the hood and present what are the few simple steps that the data admin actually follows to prepare a dataset and let data scientists work on it.
Feel free to test any part, and build your own data use cases from this demo!
As a Data Scientist: Build a TensorFlow model on data you can’t access
In this part, we assume that a sensitive dataset is ready for analysis and that a query-access was granted to you (we’ll use demo credentials).
Video where Maxime, Sarus cofounder, goes through all the steps to build a machine learning model with Sarus
Here is a notebook that illustrates how to train a TF model on remote data you do not have access to.
Just open it from your usual python environment.
NB: python 3.8 is required (unfortunately not supported by Google Colab yet or with advanced set up). For users running Jupyter on Apple M1, support is experimental, reach out!
In this notebook you will see how to manipulate the remote data exactly like a you would do with a pandas dataframe, and how to build and train a tensorflow model on the remote data. Each method that returns query-level information will fallback to synthetic data (e.g.: dataframe.head()). All aggregate queries are executed against the original data with differential privacy (e.g.: dataframe.mean(), model.fit()).
Don’t hesitate to adapt the code and play around!
You can also check out our second example on how to do sentiment analysis on IMDB reviews!
As a Data Scientist or Analyst: Do BI analysis with Metabase
Now let’s see how you can use Sarus from a BI tool like Metabase, Power BI or Tableau. We will use Metabase as it is open source (to install it, follow these instructions).
We assume that the sensitive census dataset is ready for analysis and that a query-access was granted to you (we’ll use demo credentials).
Video on how to do BI in Metabase with Sarus
All we need to do is to set up a connection between the Metabase instance and Sarus. In the Metabase admin settings, create a new database with the following parameters:
- Database type: sparkSQL
- Name: Private Census
- Host: demo.sarus.tech
- Port: 10000
- Database name: private_census
- Username: firstname.lastname@example.org
- Password: Demo1
- Additional JDBC connection string options (required): ;auth=noSasl
The remote table behaves exactly like a table you have full access to. Each aggregate query gets automatically rewritten with differential privacy. All the record-level queries are computed on synthetic data.
As a Data Admin: Prepare datasets and grant query access rights with privacy policies
To let data scientists work on sensitive data they cannot see, a data admin has to prepare the dataset and grant privacy-safe accesses, i.e. define who can query it and how. Let’s see how it works.
Video on how to prepare data and grant privacy-safe accesses to data scientists
We will first create a data scientist user, then prepare a dataset and finally grant the appropriate rights for the data scientist to work on the dataset. All that in just a few minutes!
- Sign up to the Sarus admin dashboard using Google SSO.
- Go to the “Users” section to create a user account for the data scientist you want to grant query-access to.
- Fill in the data scientist email address. It should be a different email address from the one you used to log in.
- Apply “Data Practitioner” role and validate.
- Share the invitation link with your data scientist friend. If you’re playing both characters, you can open it in a private browser session or another browser to finalize the data scientist account creation.
- Go to the “Datasets” section and click the “+Add” button.
- Choose a name for your dataset.
- Select a source using a data connection. Sarus has connectors to many sources and storage solutions (GCS, BQ, S3, AzureSQL, PostgreSQL, etc.). Here we’ll use the “default Redshift connection” that is already set up in the demo version.
- Choose a data table from the available data sources (e.g: private.census).
- Click “Next” to launch the dataset preparation. The app automatically reads the data and detects the schema.
- Finalize the preparation by clicking the “Add dataset” button. It will trigger the generation of the synthetic data, which may take up to 10’.
- On the “Access Rules” tab, select the data scientist account you have created, assign the predefined privacy policies “Internal Analysts/Data Scientists policy” to them, and click “Add”.
When the synthetic data generation is finished, the dataset status will turn to “Ready” and the data scientist will be able to start running their analyses with their credentials! (See Data Scientist parts above! They can start with the notebooks and adapt them to use their new credentials on the new dataset. They will be able to do any data analysis they feel like without any risk of exposing personal data!)
In this demo, you have seen how a data scientist can train machine learning models and analysts do BI analysis on sensitive data without directly seeing it. You’ve also seen how easily a data admin can set all that up.
At no point any personal information was exposed to the analysts, all interactions were protected with differential privacy.
Of course, this is just a subset of Sarus’ potential. Our application actually directly deploys in the customers’ cloud in just a few clicks, features connections to all standard data sources and lets you work on any type of data, with all your favorite analytics tools and machine learning libraries.
We hope you liked it! Don’t hesitate to play more, with any part of our product (admin, ML and/or BI), comment, share, and get in touch! We’d be super happy to get your thoughts!