Carry out medical research on OMOP data with automated privacy guarantees

Case study: Sarus for Healthcare

Data science
Machine Learning
Health Care
Josselin Pomat

Carry out medical research without exposing patient data


The challenge: improve patient care using privacy-sensitive patient data

Hospitals collect fine-grained data about their patients to support medical treatment. This data is highly valuable to improve medical research, but also highly sensitive. The data is legitimately subject to strict regulations designed to protect patient privacy making it hard, long and sometimes impossible to be used for medical research.

In this blog post, we will show you how Sarus helps unlock patient data for research, without sacrificing patients’ privacy.

Dataset

In this demo, we are using a sample of an OMOP-like dataset which consists of 3 tables: patient information, type of treatments, and medication origins. These tables contain information linked to 1000 patients. This data is an extract from the CMS Synthetic Patient Data OMOP.

Getting access to such a fine-grained level of information on patients is notoriously difficult. It typically requires months-long data compliance processes, possibly with a formal approval from the data protection authority.

First rows of the person table

Find the data on our github.

Context for the case study

A researcher in heart diseases is given access to the dataset. They want to build a machine learning model to detect potential atrial fibrillation. The researcher will explore the data and build an adequate model.

Extract, explore, pre-process

Notebook intro

Find the notebook here.

The first lines of codes take care of importing the Sarus libraries and loading the data science and analytics API to be used and then connecting to the Sarus instance where the dataset was onboarded by the Data Administrator.

import sarus
import sarus.pandas as pd
import sarus.numpy as np

from sarus.sklearn.model_selection import train_test_split
from sarus.sklearn.ensemble import RandomForestClassifier
from sarus.sklearn.pipeline import Pipeline
from sarus.sklearn.impute import SimpleImputer
from sarus.sklearn.compose import ColumnTransformer
from sarus.sklearn.preprocessing import scale, OneHotEncoder
from sarus.sklearn.model_selection import cross_val_score

from sarus import Client

client = Client(url="https://admin.sarus.tech/gateway", email="analyst@example.com")

Explore the tables

Once the dataset has been selected, the researcher can browse its tables.

remote_dataset = client.dataset(slugname='patient_data_2')
remote_dataset.tables()
df = remote_dataset.table(['patient_data','private','dose_era_1000']).as_pandas()

Notice the message: ‘Evaluated from synthetic data only.’ The privacy policy prevents all row-level information from being retrieved, so the application will return synthetic data instead. You can check out this article to learn more about our synthetic data generation model.

Synthetic data enables the researcher to explore the underlying data and get familiar with it even if they cannot see the real records.

Extract the relevant data with a SQL query and process them with pandas

After doing some exploration on the different tables from the dataset, the researcher defines the extraction of interest by using a SQL query.

extract = remote_dataset.sql(""" 
SELECT day_of_birth, month_of_birth, year_of_birth, 
    gender_concept_id, cond_occ.person_id, ethnicity_source_value, 
    race_source_value, gender_source_value,location_id, 
    ethnicity_concept_id, race_concept_id, person_source_value, 
    condition_source_value, condition_end_date, condition_start_date, 
    condition_concept_id
FROM patient_data_2.private.person_1000 AS patient
JOIN patient_data_2.private.condition_occurrence_1000_v2 AS cond_occ
ON patient.person_id = cond_occ.person_id
WHERE year_of_birth < 1942 """
)
df = extract.as_pandas()

Now that the relevant extract has been defined, the researcher will be able to define the preprocess pipeline necessary to train a Machine Learning model.

# Choosing conditions corresponding to heart deseases
diabetes_conditions = [201826, 195771]
atrial_fibrilation_conditions = [313217]

# Small feature engineering
conditions = df.drop([
    'person_source_value',
    'condition_source_value',
    'condition_end_date',
    'condition_start_date'
    ], axis=1)

conditions['has_diabetes'] = conditions['cond_occ_person_id']\
      .isin(conditions.loc[conditions['condition_concept_id']\
      .isin(diabetes_conditions), 'cond_occ_person_id']).astype('int')

conditions['atrial_fibrilation'] = conditions['cond_occ_person_id']\
      .isin(conditions.loc[conditions['condition_concept_id']\
      .isin(atrial_fibrilation_conditions), 'cond_occ_person_id'])\
      .astype('int')

# Defining training and target (atrial fibrillation)
target_colname = 'atrial_fibrillation'
df_aggreg_features = conditions.drop(['condition_concept_id'], axis=1)\
      .drop_duplicates()
X = df_aggreg_features.drop([target_colname], axis=1)
y = df_aggreg_features[target_colname]

# Splitting into train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Then, the researcher builds a pipeline with scikit-learn, trains the model and checks its accuracy.

# Defining numeric features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# Building pipeline
# Fill in missing numeric values with the mean
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))
    ])

# Set up the preprocessing workflow
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features)
        ])

# Final pipe
pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('model_forest', RandomForestClassifier())
])
# Training and evaluating the model
scores = cross_val_score(pipe, X_train, y_train, cv=5)
score_mean = np.mean(sarus.eval(scores))

print(score_mean)

Output:
Whitelisted
0.7876

Here the researcher trained a RandomForestClassfieir using the scikit-learn library. Now, he will evaluate the model by comparing the accuracy score with the ratio in the data.

Note that the term ‘Whitelisted’ is in the output. It means that the computation was done using the original data without rewriting by exception (unlike computation executed against synthetic data or computation rewritten to have differential privacy guarantees). This comes from how the privacy policy was defined. It granted two exceptions to the researcher so that they can use cross_val_score and pipe.score. Thanks to the exceptions, the outputs of these methods can be retrieved by the researcher..

# Checking percentage of persons who suffered atrial fibrilation 
r = client.query("""
SELECT COUNT(DISTINCT patient.person_id)
FROM patient_data.private.person_1000 AS patient
JOIN patient_data.private.condition_occurrence_1000 AS cond_occ
ON patient.person_id = cond_occ.person_id
WHERE year_of_birth < 1942
AND condition_concept_id IN (313217)
                """
)

nb_atrial_fibrillation = pd.DataFrame(
      r['result'], 
      columns=r['columns']).values[0]

r = client.query("""
SELECT COUNT(DISTINCT patient.person_id)
FROM patient_data.private.person_1000 AS patient
JOIN patient_data.private.condition_occurrence_1000 AS cond_occ
ON patient.person_id = cond_occ.person_id
WHERE year_of_birth < 1942"""
)

nb_patients = pd.DataFrame(r['result'], columns=r['columns']).values[0]

nb_atrial_fibrillation / nb_patients

Output:
array([0.6536]) 

The accuracy of the model was close to 0.78, which means the detection rate was increased by more than 0.12 points. If the researcher considers the model to be accurate enough, it can be exported for future use. Or they can keep exploring new modeling strategies, still without spending any time waiting for data access approvals.

# Extracting the model to put it in production
pipe_prod = sarus.eval(pipe)

import pickle
with open("model.pkl", "wb") as f:
  pickle.dump(pipe_prod, f)

Conclusion

What were the benefits of using Sarus?

On the one hand, the hospital was able to make its data useful for research in a fully compliant way. The data was never copied or shared. The researcher never saw a single row of patient data. Every result was protected according to the privacy policy. .

On the other hand, the researcher was able to leverage the data right away. They could explore it freely thanks to synthetic data. Preprocessing and training of the machine learning algorithm were very standard; there was no need to learn a new way of coding.

Overall, the process was more secure and faster than it would be without using Sarus, and the robustness of the final machine learning algorithm is the same!

Want to give it a try? Book a demo!

About the author

Josselin Pomat

Customer success manager & Product

Ready?

Ready to unlock the value of your data? We can set you up in no time.
main.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Shell

Subscribe to our newsletter

You're on the list! Thank you for signing up.
Oops! Something went wrong while submitting the form.
32, rue Alexandre Dumas
75011 Paris — France
Resources
Blog
©2023 Sarus Technologies.
All rights reserved.