Marketing segmentation without data access

You don’t need access to confidential customer data to build your AI-based marketing segments, here’s how!

Data science
Josselin Pomat

Design marketing segmentation that don't expose customer data

Marketing segmentation challenge: create value while protecting highly sensitive data

Market segmentation is a crucial marketing strategy that involves dividing a heterogeneous market into homogeneous groups based on certain shared characteristics, such as age, gender, or income. By utilizing highly personal features, marketers can create clusters to target each group with personalized promotions, coupons, and other relevant information. However, accessing these personal features poses significant privacy risks, making them often inaccessible to the marketer.

This demo will show how a marketing team will be able to activate a market segmentation for a cosmetics brand, using data from a grocery retailer, without compromising the privacy of individuals thanks to Sarus.

Implementing market segmentation and activation with Sarus

What is Sarus?

Sarus is a solution which allows analysts to work on sensitive data without seeing it in the process. They can conduct statistical analysis and develop machine learning models to extract insights and make decisions without compromising the privacy of individuals.

We’ll see why this is a secure and efficient way to design a market segmentation strategy.

Retail dataset

Find the data on our github here.

In this demo, we are using a subset of the retail dataset Completejourney — you can find a full description here. The original dataset contains data about household-level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store — we will be using only data of 800 households for whom there are demographic insights.

The demo uses three tables:

['demographics_demo', 'transactions_sample', 'products_demo']

The demographics table provides detailed information about the socio-economic and demographic characteristics of households. The transactions and products tables describe the purchases made by households, including the amount spent, when and where they made the purchase, product category, and origin.

First rows of the demographics table

Data sensitivity

Considering that “99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes” according to this scientific article, hence the retail dataset we will be using is very sensitive. Moreover it also contains insights about households’ consumption habits that can facilitate even more re-identification.

Dataset preparation

In order to conduct this market segmentation, first the data administrator from the retailer — who collected and owns the data — onboards the dataset in Sarus. During this step, the application automatically generates Synthetic Data with a deep-learning generative model that preserves types and links between tables (learn more about the model in this article). Having a precise understanding of the data is very important for the analyst, and the synthetic data allows to get these precious elements.

Sample of the Synthetic Data of the demographics table

The data administrator from the retailer also customizes the Privacy Policy applied to the data scientist from the marketing team, specifying the results they will receive. In our case, they will have access to the Synthetic Data, Differentially Private results and the right to activate ids.

The data scientist now has access to the dataset called ‘retail_data’ and can conduct the market segmentation using Python.

Extract, explore, pre-process and segment the dataset

Notebook intro

You can find the notebook here!

First, the analyst connects to the instance where the dataset has been set up by the data administrator:

from sarus import Clientclient = Client(
    password= ***)

The analyst will be using pandas, numpy and scikit-learn to manipulate the remote dataset. For this, the respective APIs from the Sarus libraries must be imported.

import sarus.pandas as pd
import sarus.numpy as np
from sarus.sklearn.cluster import KMeans

Explore the tables

Once the dataset has been selected, the analyst can see all its tables and get samples from each of them.

remote_dataset = client.dataset(slugname='retail_data')remote_dataset.tables()

Note the output message: ‘Evaluated from synthetic data only’. Indeed, the analyst’s Privacy Policy does not allow them to see raw data, therefore the Sarus API returns the best alternative output according to the Policy: rows of synthetic data.

Extract the relevant data with a SQL query and process them with pandas

After doing some exploration on the different tables, the data scientist can create the view they want to work on using a SQL query.

query = """SELECT *
    FROM retail_data.private.demographics_demo d
    JOIN retail_data.private.transactions_sample t
    USING (household_id)
    JOIN retail_data.private.products_demo p
    USING (product_id)"""
df = remote_dataset.sql(query).as_pandas()

Now that the relevant data have been selected, the data scientist still has to correct the format in order to train a Machine Learning algorithm.

# Check the missing values and drop them

# Check the type of columns

# Select the columns
df_dem = df[['home_ownership', 'age', 'income','marital_status',
     'household_size', 'household_comp', 'kids_count']]

# Process the categorical columns
cat = pd.get_dummies(df_dem.select_dtypes(["object"]), drop_first=True)

# Add the cosmetics consumption column

cosmetics_consumption = df.loc[df['department'] =='COSMETICS']\
    .groupby('household_id').agg({'income' : 'count'})\
    .rename(columns = {'income' : 'count_cosmetics_consumption'})

df_full = pd.merge(cat, cosmetics_consumption, how='left',

# Select a model and train it
model = KMeans(n_clusters=4)
fitted_model =

# Add the group column to the dataframe 
labels = fitted_model.predict(df_full)
new_df = pd.concat([
            pd.DataFrame(labels, columns=['group'])],

## Create the audiences

list_ids_1 = new_df.loc[ == 0]['household_id']
list_ids_2 = new_df.loc[ == 1 ]['household_id']

Here we trained a KMeans model to create clusters and mapped each household to the corresponding group.

Activate this list of ids

Now, we’re ready to send the list of ID to a third-party tool.


And it’s done: all the insights have been pushed to the third-party tool, and the digital marketing team can instantly start using them!

Conclusion and benefits

The data scientist, armed with his trusty libraries, was able to run his analytical work in an usual way even without accessing data. They were able to define two specific audiences and use them in their marketing campaigns. The process adhered to the highest standards of data protection, ensuring that no personal information is ever exposed or leaked.

About the author

Josselin Pomat

Customer success manager & Product


Ready to unlock the value of your data? We can set you up in no time.


Subscribe to our newsletter

You're on the list! Thank you for signing up.
Oops! Something went wrong while submitting the form.
32, rue Alexandre Dumas
75011 Paris — France
©2023 Sarus Technologies.
All rights reserved.