The most common data engineering architectures are Data lakes and Data warehouses. Both are centralized systems: data from every data-producing entity is pooled into one location with a single data engineer team in charge. It comes with a number challenges and limitations:
- data engineers have no particular domain knowledge. The system producing the data may change fast, and having a separate team responsible for the integration into a centralized system adds complexity and delay. Data lakes/warehouses are not very agile.
- it’s hard to add a lot of data sources, as a centralized engineering team is difficult to scale.
- moving the data into a central repository can be challenging legally, especially across lines of business or geographic borders.
- Data leakage risk is compounded when data is copied into a central repository
Why is Data Mesh getting so much interest now?
The Data Mesh architecture was introduced in May 2019 by Zhamak Dehghani (@zhamakd) (in this paper). It is a significant shift in data architecture design, overcoming the limitation of centralized approaches and providing scalability and agility.
The core principles of a Data Mesh are:
— distributed, domain-oriented data: as opposed to a centralized repository where many entities (think of Marketing, Operations, R&D…) push their data, a Data Mesh implies that Data is organized by domain. This means that data engineers have domain knowledge, and thus can be much more efficient and agile
— Data as a product: the user experience of the consumers of data must be the key concern of the domain owner. Anybody (authorized, of course) should be able to consume data autonomously, without assistance from the data team of a domain. Data should be of high quality, discoverable, self-describing and in general easy to consume
— self-serve data platform: In order to empower domain data teams, the organization must provide a self-serve data infrastructure, that provides tools for the data team, such as big data storage, Federated Identity Management or data access logging
— Decentralized & federated governance : Global (at the organization level) standards should be defined via federated mechanisms, with domain representatives. Those standards are computationally enforced in the platform.
There is an active community around Data Mesh, that you can join here.
A change of culture & organization: when Data becomes the product
Each data-producing entity is now expected to make its data available, via a well defined, well documented standard. Data engineers are now from this domain, so it’s much easier to scale: the number of domains does not impact a centralized data team and becomes irrelevant.
Domain data teams must now consider their asset their Product, and the consumers of the data their customers. Customer satisfaction must be measured and tracked, to drive the data teams effort.
A welcome convergence of software & data engineering architecture and philosophy
The introduction of Data Meshes shares some similarities with the transition from monoliths to microservices in software engineering. It stresses the importance of domain owners and interoperability, and is designed to scale in a highly-complex setup.
Introducing the Confidential Data Mesh
In most organisations, many datasets are highly sensitive and subject to restrictions (for regulatory or commercial reasons). For example, it may not be possible to move personal data outside of a legal entity or a country. Those datasets can’t be added to the mesh, and analysts outside of the original domain cannot leverage them.
That’s a significant missed opportunity. Traditional remediation strategies rely on Data Masking but it requires bespoke data wrangling, it fails to solve the compliance challenge at scale and can have a significant impact on data utility (more on data masking shortcomings here). To leverage the full power of Data Meshes, we need to solve confidential data access at scale.
Enter Privacy-Preserving Technologies and in particular Differential Privacy.
Differential Privacy is a mathematical framework that defines privacy risks in data analysis. It provides mechanisms and algorithms that preserve private information when working on data. It was introduced by Cynthia Dwork in 2006, and was slowly deployed in commercial applications (and there is still a lot to be done). It’s largely used by Google, Apple, or Microsoft for their internal needs, seizing the opportunity to leverage data while protecting privacy. Here is a good resource to learn about DP without too much math. And for a more systematic review of it, this is the book of reference: The Algorithmic Foundations of Differential Privacy (Dwork, Roth).
Differential privacy can be considered as the definition of what it takes for the result of a computation to be anonymous. It is considered anonymous in the sense that it does not reveal significant information on any given individual. It applies to any kind of computation but machine learning or SQL analysis are obvious candidates. Applying differential privacy to one row of data would not make a lot of sense as the result should not depend on the row of data itself! But luckily, no one is doing ML or BI on one row of data. In the Data Mesh architecture, the whole dataset is located in one location so each analysis can be performed on all rows making it a perfect match for differential privacy.
With Differential Privacy we can now design a truly scalable Confidential Data Mesh. This is the extension of the classic Data Mesh when some domains have sensitive or regulated data that cannot be shared across the organization. Each domain implements the privacy policies relevant to their regulatory constraints.
In the Confidential Data Mesh, no sensitive data can be extracted from a node, but computations can still be performed on it. Since it implements the same interoperability requirements as the classic Data Mesh, data sources can be catalogued just the same. Applications that need to pull samples of data or even stream the data can always access synthetic data that mimics the property of the source and is generated with differentially-private deep learning algorithms. Node-level databases can be queried using differentially-private implementations of SQL queries. Machine learning can be trained on the original data, still with differential-privacy guarantees.
In the end, from the data practitioners’ perspective, the Confidential Data Mesh behaves exactly like the original Data Mesh. The main difference is that the ownership principles also extend to the implementation of the compliance standards and privacy policies.
Build your own Confidential Data Mesh (CDM)
The building blocks of a CDM are:
- A remote execution framework that implements differential privacy in SQL queries or ML jobs
- An accountant that keeps track of all privacy-impacting queries on the dataset
- A differentially-private synthetic data generator
- An API that implements the interoperability standards (including catalogs, data exports, SQL driver…)
- And the glue that sticks all of it together.
There is no open-source solution to deploy a CDM in one click but fortunately some of the fundamental building blocks are available. Most of the open source projects are still at the experimental stage but they are a good resource to start with. The most interesting ones are:
- OpenDP: it is a joint effort by Harvard and Microsoft that provides the main toolbox for differentially-private computation. The main contribution are a set of DP primitives (smartnoise-core) and an SDK to run SQL queries (smartnoise-sdk)
- Google Differential Privacy: Similar to OpenDP, Google DP provides DP primitives that can run on top of Apache Beam, a differentially-private SQL engine, and an privacy accountant
- TensorFlow/privacy: It allows to train TensorFlow models with differential privacy. Note that this code has been packaged in OpenMined.
- PyTorch Opacus: a library from Facebook to train PyTorch models with differential privacy
With those libraries installed on confidential domains, a user can perform queries on the confidential data with differential privacy guarantees.
This is a good start but we are still far from a working solution, as many critical parts are missing:
- The API: Data Meshes are powerful because all communication with external users is funneled through a standardized API. The same goes for CDM to provide interoperability with other nodes.
- Enforcing privacy compliance: Open source libraries are designed for a scientist that can access the original data and wants to release a private output. It is the user’s responsibility to use safe parametrization of differential privacy. In a CDM, this should be enforced so that the user should not be able to execute any privacy-exposing query.
- Optimizing Privacy consumption: DP comes with the notion of a privacy budget. But this budget will run out very fast if each query is considered in isolation. For optimal accuracy one wants to leverage all public information to build useful priors as well as anything that has been released previously. Implementing memoization is an essential part of executing consecutive queries or learning jobs.
- Synthetic data: a data practitioner will want to see some samples of the original data. Since revealing the original data is not an option, synthetic data is a must-have. It must be generated with differential-privacy to preserve the privacy guarantees.
- privacy governance with policies and accountant: how to control the leak of privacy over several requests? (hint: it’s not trivial as Epsilon, a key metric in differential privacy, is not additive: if you spent epsilon=1, then epsilon=2, you haven’t spent epsilon=3 :( ). How can an admin set different policies for different users?
If no single open source libraries can bring CDM to life, some startups and large organizations are slowly pooling all the pieces together to make CDM the most efficient way to leverage sensitive data at scale.
At Sarus, we make data-centric organizations more efficient and agile when working with sensitive data in a fully compliant way .If you want to know more about Confidential Data Mesh, how Sarus can help, get a demo and try it by yourself, reach out at firstname.lastname@example.org.
And if you’re interested in solving privacy problems with state-of-the-art technology, we’re hiring so come join us!