[Just released] AI-based synthetic data generator!

A new Deep-Learning based synthetic data generator for an even smoother analysis experience when you can't see the data!

We are thrilled to introduce the latest version of our synthetic data generation model. This new model now preserves the multivariate distributions between all columns of a table. This makes synthetic data an even more useful tool for analysts and data scientists to gain insight into data they cannot directly access.

It is extremely useful to prepare analyses, design machine learning pipelines, debug or test code. It is the natural first step before carrying out the analyses on the source data, which remains fully protected all along:

‍

from sarus import Client
client = Client(url = "https://demo.sarus.tech/gateway", email = "analyst@example.com")

remote_dataset = client.dataset(slugname="census")
households = remote_dataset.as_pandas()
households.head(3)

*Results evaluated from synthetic data only*

import seaborn as sns
import matplotlib.pyplot as plt 

grouped = households.groupby('age')
for key in grouped.groups.keys():
    sns.catplot(data=grouped.get_group(key), x='income', kind='count', orient='v',
                order=grouped.get_group(key).income.value_counts(sort=True).index).set_xticklabels(rotation=90)
    plt.title(key)
    plt.show()

*Income distribution for a given age group is preserved*

‍

Comparison of real vs. synthetic data generated with the Sarus new generative model on different datasets & variables

‍

This new deep-learning model was designed by the Sarus research team, based on Transformers and implemented in JAX, a state-of-the-art and powerful Python library that allows for high performance. If you want to learn more, we published a research paper on the topic.

Of course, this model integrates Differential Privacy to ensure that the generated synthetic data protects all personal information stored in the source data (more info on how to train a model in JAX with differential privacy).

‍

This new model certainly helps analysts and data scientists work with sensitive data that they cannot directly access, opening up many opportunities for privacy-safe analysis use cases in healthcare, finance, energy, HR, and more. It's useful everywhere companies or public authorities want to leverage data to innovate, but the data must be protected for security, compliance, and ethics!

Want to see what the high fidelity synthetic data looks like? Reach out!

‍

[Just released] AI-based synthetic data generator!

About the author

Elodie Zanella

Ready?

Subscribe to our newsletter

Sarus tech

Resources

Company