At Sarus we build tools to work with privacy sensitive data. Our product is built around 3 ideas:
- Remote execution of code, to keep sensitive data undisclosed.
- Synthetic data generation, to give the user something to build with.
- Differential privacy, to control privacy loss.
Hence, we spend a lot of time training machine learning models for synthetic data generation. Besides, we design tools for remote execution. Dask is somehow related to both. It is a great library for parallel computing in Python; but it is also a well designed piece of software and a great source of inspiration for seamless remote execution of code.
Step 1: Start a kubernetes cluster
On the GCP console, create a GKE cluster:
In our case, it will be a basic cluster named sarus-clusterin the europe-west1 region and in our current project: sarus-public.
Then you need to download a credentials file to access the cluster from your computer. To do so, simply type the following command in a terminal:
If you don’t have gcloud installed, just have a look at this doc, it should install easily on many platforms.
Your kubernetes cluster should now be running. To control the cluster you need to install the kubectl command line utility. This doc gives an overview of the tool and how to install it.
You can check the nodes of your cluster by running the following command:
Step 2: Deploy dask using helm
If you are not happy with the default configuration of the dask helm chart. You can use a configuration file:
You can check the deployment went well and that the various services provided by dask are running properly using:
You can check the state of the cluster through a web UI, but, because the dask cluster does not expose any address to access it from outside GCP, you need to forward local HTTP requests to the cluster using the following command:
You can then check http://localhost:8001/status and 🎉 you should see something like this:
Not so interesting for now 🤔.
Step 3: Connect to the cluster and voilà!
This connects you to the cluster. You are then ready to launch a simple job:
You can watch the workers getting busy in the web UI:
You can also manipulate panda-like dataframes:
We can see the cluster store the data:
Dask can also be used to launch Machine Learning jobs in parallel. Here is an example of a grid search on the hyper-parameters of a Support Vector Classification:
This demonstrate how powerful the cloud + k8s + dask can be. It also shows how mature the data-engineering ecosystem has become, since getting up and running is a matter of a few clicks and commands.
Of course there is a lot to do to make this ready for a production environment (from a security and permission management perspective at least), but kubernetes already does a lot, like automated resource provisioning and auto-scaling / self-healing.
In a later post we will show how to run real-life jobs on large datasets in GCS or S3 buckets. If you are interested in how to run remote jobs with strong privacy protection and super easy UX, contact us for a demo of Sarus.