Remark

# A partnership about big data in health

## The partnership between CNAM and Ecole polytechnique

## Stéphane Gaïffas

---

---

# Who I am ?

- Stéphane Gaïffas

- Professor

- Research and teaching in machine learning, big data and AI

- Laboratoire de Probabilités, Statistique et Modélisation

- [http://www.cmap.polytechnique.fr/~gaiffas/](http://www.cmap.polytechnique.fr/~gaiffas/)

- [stephane.gaiffas@lpsm.paris](mailto:stephane.gaiffas@lpsm.paris)

.center[
    <img src="figs/lpsm.png" style="height: 160px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/paris-diderot.png" style="height: 90px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/uspc.png" style="height: 120px;" />
]

---

# 1. Presentation of the project

## The team and the objectives

---

# The team

## Part time senior researchers

- E. Bacry (Univ. Paris Dauphine, Ecole polytechnique, CNRS)

- A. Guilloux (Univ. d'Evry, Univ. Paris Saclay)

- S. Gaïffas (Univ. Paris-Diderot, Ecole polytechnique)

- Many researchers in public health at CNAM (A. Weil, C. Gissot, etc.)

## Full time researchers and engineers

- Bio-statistics: F. Leroy (CNAMTS)

- Engineers: Y. Sebiat (X), D. Sun (X)

- Postdoc: Anastasiia Kabeshova

- PhD students: M. Morel and Y. Yu

---

# The project

## Research partnership for 3 years

- Renewed this year (2018 -- 2020)

## .stress[**Big data and AI**] on the SNIIRAM / PMSI data

- Pharmacovigilance

- Efficiency of health pathways

- Fraud detection, etc.

---

# The project

## While...

- SNIIRAM / PMSI is .stress[not designed] for this (at all !)

- It is the .stress[database that reimburses French citizens] for health care

- .stress[Oriented towards health acts, not individuals]

- **Does not contain clinical data !**

## But...

- **Extremely rich**, contains a lot of informations (a **LOT** !)

- One of the .stress[largest electronic health records database in the world] (centralization of all reimbursements in France)

---

# 2. Infrastructure

## From a vertical to an horizontal infrastructure

---

# The original "vertical" infrastructure

## Hardware infrastructure

- Huge .stress[IBM Exadata] servers with tons of RAM (and very expensive...)

## Data infrastructure

- Proprietary .stress[Oracle Relational SQL] data base, 800 tables, several tera-Bytes

##  Software infrastructure

- Proprietary .stress[SAS] software

## `$ \Rightarrow $` Completely inappropriate for new "big data" approaches

- Closed infrastructures that .stress[forbids all methodological research]
- Not appropriate for distributed data processing

---

# A new "horizontal" infrastructure

## Two big-data clusters (almost duplicated)

- One at X (for testing and development)
- One at CNAMTS (for deployment)
- NB: .stress[data cannot exit CNAMTS !] which is a **very strong constraint**

## Horizontal architecture

.pull-left-40[
- Scalable architecture: .stress[distributed data and processing]
- 4 masters
- 15 slaves
- 240 cores
- 1.9To RAM
- 480To (120 hard drives)
- .stress[Spark, Scala]
- .stress[Only open-source technology]
]

---

- Data understanding
- Flattening
- Organization
- Distributed datasets
- Column oriented format `parquet.apachage.org`
- Open source
- Development in a "agile" mode
- Code organized in a ETL fashion

## A fundamental step: .stress[flattening]

- From a data format which is ideal for .stress[SQL queries] (random access)

- To a format which is ideal for .stress[batch processing]
]

.pull-right[
  .center[
  <img src="figs/flattening.png" style="height: 250px; padding-bottom: 30px;" />
  <img src="figs/agile.png" style="height: 300px;" />
  ]
]

---

# Software architecture

## Featuring

- .stress[Transformation of the raw] data into .stress[a matrix] that can be fed to a machine learning algorithm 
- Developed in .stress[`Scala`]
- Using the .stress[`Apache Spark`] library for big data processing (distributed data and computations on the clusters)

## Machine Learning

- Use of the .stress[`R` statistical software] for "classical" models
- .stress[Homemade `tick` library] (`Python 3`, mostly coded in `C++11`) developed by our team

.center[
    <img src="figs/scala.jpg" style="height: 70px;" />
    <img src="" style="width: 80px;" />
    <img src="figs/spark.png" style="height: 90px;" />
    <img src="" style="width: 80px;" />
    <img src="figs/python.png" style="height: 90px;" />
]

---

# The .stress[`tick`] library

Just type in a terminal
```bash
pip install tick
```

- https://x-datainitiative.github.io/tick

- `Python 3` and `C++11`

- .stress[Open-source] (BSD-3 License)

- .stress[Machine learning with time-oriented models] (health, finance, etc.). Point processes (Poisson, Hawkes), Survival analysis, Generalized linear models, sparsity, etc.

- Also a simulation and optimization toolbox

- .stress[Partnership with Intel] (use-cases on Intel processor prototypes, 180 cores)

- .medium[**Contributors welcome** !]

---

# The .stress[`tick`] library

---

# 3. The pilot project

## A first project with known outcomes

---

# The pilot project

## Bladder cancer for patients with type 2 diabetes

- .stress[Pharmacovigilance]

- Development of a .stress[screening algorithm] assessing the adverse effects of several drugs

- `$ \neq $` .stress[hypothesis validation]

- Simplified cohort preparation

## Application

- Cohort: **citizens with type 2 diabetes**

- Adverse effect: **bladder cancer**

`$ \Rightarrow $ ` .stress[identification by screening of Pioglitazone] (removed from market in 2011)

---

# The pilot project

## Some figures

- .stress[2.5 million] individuals

- .stress[4 years history]

- 1.3 To (`$ \simeq $` 260Go per year)

- 2 billion lines (500 million lines per year)

- Database flattening `$ \simeq $` 40 minutes (since Spark 2.1)

- Featuring `$ \simeq $` 10mn

---

# The pilot project

## .stress[Results obtained by a "standard" algorithm]

## Cox regression (survival analysis) `$ \Rightarrow $ ` hypothesis validation

- Paper : *Pioglitazone and risk of bladder cancer among diabetic patients in France : a population-based cohort study*,  A. Neumann, A. Weill, P. Ricordeau, J. P. Fagot, F. Alla - H. Allemand Diabetologia 2012.

## Aim

- .stress[Reproduce the results]

- In order to .stress[check that our completely new pipeline just works]...

- .stress[Model shaking]: stability of results with respect to **the cohort definition parameters**. Something that could not be done using closed SAS Oracle framework

`$ \Rightarrow $` **Recovery of previously known results**

---

# Parameters of the cohort construction

## 1. Cancer definition

- **Broad definition : .stress[diagnostic C67]**

- **Narrow definition : .stress[C67 + radiotherapy + chemotherapy + ...]**

## 2. Length of the "follow up" control

- **Patients with diagnosed cancer .stress[before or during a period of A months] are removed**

- Default value: 6 months

---

# Parameters of the cohort construction

## 3. Minimum drug purchases

- **Exposed to a drug if .stress[purchased it *B* times in *N* months]**

- Default values: `$ B = 2 $` times and `$ N = 6$` months

---

# Parameters of the cohort construction

## 4. Exposition length

- **Patient is exposed to a drug .stress[*C* months after the last purchase]**

- Default: 3 months

## 5. Censoring of patients that do not purchase antidiabetic drugs anymore

- **If no drug purchase for .stress[more than *D* months]: censoring**

- Default: yes

---

# Experiments with Cox PH

---

# Experiments with Cox PH

---

# Experiments with Cox PH

---

# Experiments with Cox PH

---

# Experiments with Cox PH

## Conclusion

- Cox PH results .stress[suprinsingly robust] to cohort parameters *A, B, C, D*

- Pioglitazone has a .stress[stronger effet for narrow cancer definition]

---
class: center, middle, inverse

# 4. ConvSCCS: a new model for screening

## New results and methodology

---

# ConvSCCS: a new model for screening

## What is it about ?

- Self-controlled case-series (SCCS): .stress[only keep patients with the case] (bladder cancer)

- Patients are .stress[their own control]

- .stress[Vastly simplified cohort preparation]: need to define **only** the exposition definition

- .stress[Longitudinal model] (features and labels)

- Estimates longitudinally the .stress[impact of exposures on the probability to develop cancer]

---

# ConvSCCS: a new model for screening

## Two choices for types of exposures

## Let's start with the results !

---

# ConvSCCS: a new model for screening

---

# ConvSCCS: a new model for screening

---

# Methodology behind ConvSCCS

## Creation of a new model
  
- Based on SCCS approach: .stress[Self-Control Case Series]

- Nothing new here (Farrington 1995) but...

## Corrected and improved
  
- .stress[Auto-regressive structure] of the features

- Dedicated .stress[penalisation techniques]

- .stress[Very efficient optimization] algorithms for training of the model

---

# Methodology behind ConvSCCS

## Consequences

- .stress[Fast: only a few minutes to train the model] (on pilot project, after flattening and featuring)

- .stress[Scalability]: scales on a single machine with large cohorts (10m patients, 10K features)

## Even more importantly...

- .stress[No alignment required on the patients exposures] !!

- Translation invariance thanks to convolutions

- The .stress[solution of a major problem with SCCS models], that are prior to this work **limited to the study of a single drug at a time**

---

# Methodology behind ConvSCCS

## Setting
  
- We have .stress[individuals] `$ i=1, \ldots, n $`

- Time `$ [0, T] $` is .stress[partitioned in intervals] `$ I_b $` for `$b=1, \ldots, B$`

- We observe the .stress[number of adverse events] `$ y_{i, b} \in \mathbb N $`

- We put `$n_i = \sum_{b=1}^B y_{i, b}=$` .stress[number of adverse events of individual] `$i$`

- We observe .stress[longitudinal features] of individual `$i$` over time

over time intervals `$b=1, \ldots, B$` (drugs exposures, etc.)

- We observe .stress[static features] `$z_i = (z_i^1, \ldots, z_i^p) \in \mathbb R^p$` (gender, age if `$B$` is small, etc.)

---

# Methodology behind ConvSCCS

## Autoregressive features

Intensity of occurrence of adverse events at time `$b$` depends on feature `$j$` via:

where:

- `$ \theta_j^k = $` **effect of feature `$j$` when exposure occurred `$k$` time intervals before the current one**

- `$ x_{i, b}^{j, k}= $` **exposure of individual `$i$` to drug `$j $` that occurred `$k$` intervals before interval `$b$`**

---

# Methodology behind ConvSCCS

## Autoregressive features

Intensity of occurrence of adverse events at time `$b$` depends on feature `$j$` via:

## Leads to a .stress[translation-invariant parametrization] of the model

- .stress[no time realignment] is required

- .stress[Solves a big issue] in SCCS literature (where only one type of exposure, i.e. a single molecule is used !)

$$
\newcommand{\E}{\mathbb E}
\newcommand{\P}{\mathbb P}
\newcommand{\R}{\mathbb R}
\newcommand{\N}{\mathbb N}
\newcommand{\inr}[1]{\langle #1 \rangle}
\newcommand{\btheta}{\boldsymbol \theta}
\newcommand{\bX}{\boldsymbol X}
\newcommand{\ind}[1]{\mathbf_{#1}}
$$

---

# Methodology behind ConvSCCS

## Notation

- `$ b $` stands for the .stress[current index]

- `$ k $` stands for the .stress[lag]

- `$ x_{i, b}^{j, k} = 0 $` for any `$ k \geq b $`

We define the `$ d \times B $` matrix `$ \boldsymbol X_{i, b} $` with entries

for `$ j=1, \ldots, d$`, `$ k=0, \ldots, B-1$`, `$ i=1, \ldots, n$` and `$ b=1, \ldots, B$`.
We define also

.center[
`$\displaystyle
   \inr{\btheta, \bX_{i, b}} = \sum_{j=1}^d \sum_{k=0}^{B-1}
  \theta_k^j x_{i, b}^{j, k}
$`
]

---

# Methodology behind ConvSCCS

## **Self-controlled case series** or .stress[conditional Poisson regression]

- Trick is to .stress[exploit the order statistics property of Poisson processes]

- Use a model on the conditional distribution of `$ (y_{i, 1}, 
  \ldots, y_{i, B}) | n_i$`, where `$ n_i = \sum_{b=1}^B y_{i, b}$`

Distribution of `$(y_{i, 1}, \ldots, y_{1, B}) $` conditionally on `$ (n_i, x_i) $` is

.center[
`$\displaystyle
 \text{Multinomial}\bigg( n_i, \frac{e^{\inr{\bX_{i, 1}, \btheta}}}{\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}}, \ldots,
  \frac{e^{\inr{\bX_{i, B}, \btheta}}}{\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}} \bigg)
$`
]

---

# Methodology behind ConvSCCS

Namely, we have that

.center[
`$\displaystyle
   \P(y_{i, 1}, \ldots, y_{1, B} | n_i, x_i) =
  \frac{n_i!}{\prod_{b=1}^B y_{i b}!}
  \prod_{b=1}^B \Big(
  \frac{e^{\inr{\bX_{i, b}, \btheta}}}
  {\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}} \Big)^{y_{i, b}}
$`
]

## Important remark

Constant effects (independent on `$ b $`, such as the `$ z_i $`) are .stress[killed by the conditioning] with respect to `$ n_i $`, since whenever
.center[
`$
   \lambda_{i, b} = e^{\inr{\bX_{i, b'}, \btheta} + \beta^\top z + c_b},
$`
]
we have
.center[
`$\displaystyle
   \frac{\lambda_{i, b}}{\sum_{b'=1}^B \lambda_{i, b'}} 
    = \frac{e^{\inr{\bX_{i, b}, \btheta}}}
    {\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}}.
$`
]

---

# Methodology behind ConvSCCS

## A dedicated penalization technique

- We want to consider a .stress[large number of lags] `$ K $`, but we want to .stress[smooth the time-adjacent coefficients] `$ \theta_1^j, \ldots, \theta_B^j $`

- We use .stress[group total-variation penalization]

.center[
`$
  \displaystyle
  \gamma \sum_{i=1}^d \sum_{k=1}^{B-1} | \theta_{k}^j - \theta_{k-1}^j|
$`
]

- We can use also group Lasso, to cancel full groups
`$ \theta_{\bullet}^j = (\theta_1^j, \ldots, \theta_K^j ) $`

.center[
`$
  \displaystyle
  \gamma \sum_{i=1}^d \sum_{k=1}^{B-1} | \theta_{k}^j - \theta_{k-1}^j| + \tau \sum_{i=1}^d \| \theta_{\bullet}^j  \|_2,
$`
]

- where `$ \gamma, \nu > 0 $`  are hyper-parameters and `$ \| \bullet \|_2 $` is the Euclidean norm

---

# Methodology behind ConvSCCS

## Putting things together and assuming `$\bX_{i, b}$` contains exposures indicators

We minimization with respect to `$\btheta$` of

.center[
`$
  \displaystyle
  \frac 1n \sum_{i=1}^n \sum_{k=1}^K y_{ik}
   \log \Big( \displaystyle \frac{\lambda_{ik}(\phi, \btheta)}{\sum_{k'=1}^K 
   \lambda_{ik}(\phi, \btheta)} \Big) +  \gamma \sum_{j=1}^J \sum_{k=1}^{p-1} 
   | \theta^j_{k+1} - \theta^j_{k} |
$`
]
where
.center[
`$\displaystyle
    \lambda_{ik}(\phi, \theta) = \exp\Big(\phi_k + \sum_{j=1}^d \sum_{l=1}^{L_i^j} \theta_{k-c_{il}^j}^j \mathbf 1_{[0,p]} (k-c_{il}^j) \Big),
$`
]

with
- `$ \phi_k =$` age effect and wehere we  recall that `$ \theta_t^j =$` effect of exposure to drug `$ j $` after a period of `$ t$` months

---

# Methodology behind ConvSCCS
  
## Tips and tricks

- Stratified `$ V $`-fold cross-validation for `$ \lambda $`
- Very fast solver: SGD with variance reduction
- Fast proximal operator for total-variation
- Exploit sparsity of the `$ \bX_{i, b}$`

## Available tools

- Right-censoring
- Other types of featuring in `$\bX_{i, b} $` 
- Features product (joint exposures to drugs)
- Confidence intervals using parametric bootstrap

## Next generalizations

- Multi-task (many adverse events at the same time)
- Non-case series

---
class: center, middle, inverse

# 5. What's next and conclusions

---

# Ongoing work

## Cohort of "old" persons

- Drugs: .stress[large classes] (several dozens)
- .stress[Many drug interactions] (many exposures with old persons)
- Adverse effects: .stress[fall] (detected via fractures)

## Some figures

- .stress[12 millions individuals] (compared to 2.5 for the pilot project)
- `$\approx$` 1.6To per year (compared to 250Go for the pilot project)
- 2 billons lines per year

## Some challenges

- Scalability of our pipeline
- Robustness of our model to short-term effects (pilot project was all about long term effects)

---

# Ongoing work

## Cohort of patients with prostatic surgery

- Partnership with the .stress[urology service at Tenon hospital] (Paris)
- PhD of Y. Yu
- .stress[All male patients in France] that had prostatic surgery
- About .stress[200K men], many features (not only drug exposures)
- .stress[Predict post-surgery complications] before the surgery
- A very different approach, based on .stress[deep learning methods]
- Direct use of the full .stress[pathways of patients, including all health events]

## Challenges

- Very .stress[irregular / sparse] longitudinal features
- Architecture based on .stress[irregular recurrent neural networks] models
- Large number of medical codes (about 30K)

---

# Thank you !