class: center, middle, inverse # A partnership about big data in health ## The partnership between CNAM and Ecole polytechnique ## Stéphane Gaïffas --- layout: true class: top --- # Who I am ? - Stéphane Gaïffas - Professor - Research and teaching in machine learning, big data and AI - Laboratoire de Probabilités, Statistique et Modélisation - [http://www.cmap.polytechnique.fr/~gaiffas/](http://www.cmap.polytechnique.fr/~gaiffas/) - [stephane.gaiffas@lpsm.paris](mailto:stephane.gaiffas@lpsm.paris) .center[
] --- class: center, middle, inverse # 1. Presentation of the project ## The team and the objectives --- # The team ## Part time senior researchers - E. Bacry (Univ. Paris Dauphine, Ecole polytechnique, CNRS) - A. Guilloux (Univ. d'Evry, Univ. Paris Saclay) - S. Gaïffas (Univ. Paris-Diderot, Ecole polytechnique) - Many researchers in public health at CNAM (A. Weil, C. Gissot, etc.) ## Full time researchers and engineers - Bio-statistics: F. Leroy (CNAMTS) - Engineers: Y. Sebiat (X), D. Sun (X) - Postdoc: Anastasiia Kabeshova - PhD students: M. Morel and Y. Yu --- # The project ## Research partnership for 3 years - Renewed this year (2018 -- 2020) ## .stress[**Big data and AI**] on the SNIIRAM / PMSI data - Pharmacovigilance - Efficiency of health pathways - Fraud detection, etc. --- # The project ## While... - SNIIRAM / PMSI is .stress[not designed] for this (at all !) - It is the .stress[database that reimburses French citizens] for health care - .stress[Oriented towards health acts, not individuals] - **Does not contain clinical data !** ## But... - **Extremely rich**, contains a lot of informations (a **LOT** !) - One of the .stress[largest electronic health records database in the world] (centralization of all reimbursements in France) --- class: center, middle, inverse # 2. Infrastructure ## From a vertical to an horizontal infrastructure --- # The original "vertical" infrastructure ## Hardware infrastructure - Huge .stress[IBM Exadata] servers with tons of RAM (and very expensive...) ## Data infrastructure - Proprietary .stress[Oracle Relational SQL] data base, 800 tables, several tera-Bytes ## Software infrastructure - Proprietary .stress[SAS] software ## `\( \Rightarrow \)` Completely inappropriate for new "big data" approaches - Closed infrastructures that .stress[forbids all methodological research] - Not appropriate for distributed data processing --- # A new "horizontal" infrastructure ## Two big-data clusters (almost duplicated) - One at X (for testing and development) - One at CNAMTS (for deployment) - NB: .stress[data cannot exit CNAMTS !] which is a **very strong constraint** ## Horizontal architecture .pull-left-40[ - Scalable architecture: .stress[distributed data and processing] - 4 masters - 15 slaves - 240 cores - 1.9To RAM - 480To (120 hard drives) - .stress[Spark, Scala] - .stress[Only open-source technology] ] .pull-right-60[
] --- .pull-left[ # Data preprocessing - Data understanding - Flattening - Organization - Distributed datasets - Column oriented format `parquet.apachage.org` - Open source - Development in a "agile" mode - Code organized in a ETL fashion ## A fundamental step: .stress[flattening] - From a data format which is ideal for .stress[SQL queries] (random access) - To a format which is ideal for .stress[batch processing] ] .pull-right[ .center[
] ] --- # Software architecture ## Featuring - .stress[Transformation of the raw] data into .stress[a matrix] that can be fed to a machine learning algorithm - Developed in .stress[`Scala`] - Using the .stress[`Apache Spark`] library for big data processing (distributed data and computations on the clusters) ## Machine Learning - Use of the .stress[`R` statistical software] for "classical" models - .stress[Homemade `tick` library] (`Python 3`, mostly coded in `C++11`) developed by our team .center[
] --- # The .stress[`tick`] library Just type in a terminal ```bash pip install tick ``` - https://x-datainitiative.github.io/tick - `Python 3` and `C++11` - .stress[Open-source] (BSD-3 License) - .stress[Machine learning with time-oriented models] (health, finance, etc.). Point processes (Poisson, Hawkes), Survival analysis, Generalized linear models, sparsity, etc. - Also a simulation and optimization toolbox - .stress[Partnership with Intel] (use-cases on Intel processor prototypes, 180 cores) - .medium[**Contributors welcome** !] --- # The .stress[`tick`] library .center[
] --- class: center, middle, inverse # 3. The pilot project ## A first project with known outcomes --- # The pilot project ## Bladder cancer for patients with type 2 diabetes - .stress[Pharmacovigilance] - Development of a .stress[screening algorithm] assessing the adverse effects of several drugs - `\( \neq \)` .stress[hypothesis validation] - Simplified cohort preparation ## Application - Cohort: **citizens with type 2 diabetes** - Adverse effect: **bladder cancer** `\( \Rightarrow \) ` .stress[identification by screening of Pioglitazone] (removed from market in 2011) --- # The pilot project ## Some figures - .stress[2.5 million] individuals - .stress[4 years history] - 1.3 To (`\( \simeq \)` 260Go per year) - 2 billion lines (500 million lines per year) - Database flattening `\( \simeq \)` 40 minutes (since Spark 2.1) - Featuring `\( \simeq \)` 10mn --- # The pilot project ## .stress[Results obtained by a "standard" algorithm] ## Cox regression (survival analysis) `\( \Rightarrow \) ` hypothesis validation - Paper : *Pioglitazone and risk of bladder cancer among diabetic patients in France : a population-based cohort study*, A. Neumann, A. Weill, P. Ricordeau, J. P. Fagot, F. Alla - H. Allemand Diabetologia 2012. ## Aim - .stress[Reproduce the results] - In order to .stress[check that our completely new pipeline just works]... - .stress[Model shaking]: stability of results with respect to **the cohort definition parameters**. Something that could not be done using closed SAS Oracle framework `\( \Rightarrow \)` **Recovery of previously known results** --- # Parameters of the cohort construction ## 1. Cancer definition - **Broad definition : .stress[diagnostic C67]** - **Narrow definition : .stress[C67 + radiotherapy + chemotherapy + ...]** ## 2. Length of the "follow up" control - **Patients with diagnosed cancer .stress[before or during a period of A months] are removed** - Default value: 6 months .center[
] --- # Parameters of the cohort construction ## 3. Minimum drug purchases - **Exposed to a drug if .stress[purchased it *B* times in *N* months]** - Default values: `\( B = 2 \)` times and `\( N = 6\)` months .center[
] --- # Parameters of the cohort construction ## 4. Exposition length - **Patient is exposed to a drug .stress[*C* months after the last purchase]** - Default: 3 months .center[
] ## 5. Censoring of patients that do not purchase antidiabetic drugs anymore - **If no drug purchase for .stress[more than *D* months]: censoring** - Default: yes --- # Experiments with Cox PH .center[
] --- # Experiments with Cox PH .center[
] --- # Experiments with Cox PH .center[
] --- # Experiments with Cox PH .center[
] --- # Experiments with Cox PH ## Conclusion - Cox PH results .stress[suprinsingly robust] to cohort parameters *A, B, C, D* - Pioglitazone has a .stress[stronger effet for narrow cancer definition] --- class: center, middle, inverse # 4. ConvSCCS: a new model for screening ## New results and methodology --- # ConvSCCS: a new model for screening ## What is it about ? - Self-controlled case-series (SCCS): .stress[only keep patients with the case] (bladder cancer) - Patients are .stress[their own control] - .stress[Vastly simplified cohort preparation]: need to define **only** the exposition definition - .stress[Longitudinal model] (features and labels) - Estimates longitudinally the .stress[impact of exposures on the probability to develop cancer] --- # ConvSCCS: a new model for screening ## Two choices for types of exposures .center[
] ## Let's start with the results ! --- # ConvSCCS: a new model for screening .center[
] --- # ConvSCCS: a new model for screening .center[
] --- # Methodology behind ConvSCCS ## Creation of a new model - Based on SCCS approach: .stress[Self-Control Case Series] - Nothing new here (Farrington 1995) but... ## Corrected and improved - .stress[Auto-regressive structure] of the features - Dedicated .stress[penalisation techniques] - .stress[Very efficient optimization] algorithms for training of the model --- # Methodology behind ConvSCCS ## Consequences - .stress[Fast: only a few minutes to train the model] (on pilot project, after flattening and featuring) - .stress[Scalability]: scales on a single machine with large cohorts (10m patients, 10K features) ## Even more importantly... - .stress[No alignment required on the patients exposures] !! - Translation invariance thanks to convolutions - The .stress[solution of a major problem with SCCS models], that are prior to this work **limited to the study of a single drug at a time** --- # Methodology behind ConvSCCS ## Setting - We have .stress[individuals] `\( i=1, \ldots, n \)` - Time `\( [0, T] \)` is .stress[partitioned in intervals] `\( I_b \)` for `\(b=1, \ldots, B\)` - We observe the .stress[number of adverse events] `\( y_{i, b} \in \mathbb N \)` - We put `\(n_i = \sum_{b=1}^B y_{i, b}=\)` .stress[number of adverse events of individual] `\(i\)` - We observe .stress[longitudinal features] of individual `\(i\)` over time .center[ `\(\displaystyle x_{i, b} = (x_{i, b}^1, \ldots, x_{i, b}^d) \in \mathbb R^d \)` ] over time intervals `\(b=1, \ldots, B\)` (drugs exposures, etc.) - We observe .stress[static features] `\(z_i = (z_i^1, \ldots, z_i^p) \in \mathbb R^p\)` (gender, age if `\(B\)` is small, etc.) --- # Methodology behind ConvSCCS ## Autoregressive features Intensity of occurrence of adverse events at time `\(b\)` depends on feature `\(j\)` via: .center[ `\(\displaystyle \sum_{k=0}^{K-1} \theta_k^j x_{i, b}^{j, k} \)` ] where: - `\( \theta_j^k = \)` **effect of feature `\(j\)` when exposure occurred `\(k\)` time intervals before the current one** - `\( x_{i, b}^{j, k}= \)` **exposure of individual `\(i\)` to drug `\(j \)` that occurred `\(k\)` intervals before interval `\(b\)`** --- # Methodology behind ConvSCCS ## Autoregressive features Intensity of occurrence of adverse events at time `\(b\)` depends on feature `\(j\)` via: .center[ `\(\displaystyle \sum_{k=0}^{K-1} \theta_k^j x_{i, b}^{j, k} \)` ] ## Leads to a .stress[translation-invariant parametrization] of the model - .stress[no time realignment] is required - .stress[Solves a big issue] in SCCS literature (where only one type of exposure, i.e. a single molecule is used !) $$ \newcommand{\E}{\mathbb E} \newcommand{\P}{\mathbb P} \newcommand{\R}{\mathbb R} \newcommand{\N}{\mathbb N} \newcommand{\inr}[1]{\langle #1 \rangle} \newcommand{\btheta}{\boldsymbol \theta} \newcommand{\bX}{\boldsymbol X} \newcommand{\ind}[1]{\mathbf_{#1}} $$ --- # Methodology behind ConvSCCS ## Notation - `\( b \)` stands for the .stress[current index] - `\( k \)` stands for the .stress[lag] - `\( x_{i, b}^{j, k} = 0 \)` for any `\( k \geq b \)` We define the `\( d \times B \)` matrix `\( \boldsymbol X_{i, b} \)` with entries .center[ `\(\displaystyle (\boldsymbol X_{i, b})_{j, k} = x_{i, b}^{j, k} \)` ] for `\( j=1, \ldots, d\)`, `\( k=0, \ldots, B-1\)`, `\( i=1, \ldots, n\)` and `\( b=1, \ldots, B\)`. We define also .center[ `\(\displaystyle \inr{\btheta, \bX_{i, b}} = \sum_{j=1}^d \sum_{k=0}^{B-1} \theta_k^j x_{i, b}^{j, k} \)` ] --- # Methodology behind ConvSCCS ## **Self-controlled case series** or .stress[conditional Poisson regression] - Trick is to .stress[exploit the order statistics property of Poisson processes] - Use a model on the conditional distribution of `\( (y_{i, 1}, \ldots, y_{i, B}) | n_i\)`, where `\( n_i = \sum_{b=1}^B y_{i, b}\)` Distribution of `\((y_{i, 1}, \ldots, y_{1, B}) \)` conditionally on `\( (n_i, x_i) \)` is .center[ `\(\displaystyle \text{Multinomial}\bigg( n_i, \frac{e^{\inr{\bX_{i, 1}, \btheta}}}{\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}}, \ldots, \frac{e^{\inr{\bX_{i, B}, \btheta}}}{\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}} \bigg) \)` ] --- # Methodology behind ConvSCCS Namely, we have that .center[ `\(\displaystyle \P(y_{i, 1}, \ldots, y_{1, B} | n_i, x_i) = \frac{n_i!}{\prod_{b=1}^B y_{i b}!} \prod_{b=1}^B \Big( \frac{e^{\inr{\bX_{i, b}, \btheta}}} {\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}} \Big)^{y_{i, b}} \)` ] ## Important remark Constant effects (independent on `\( b \)`, such as the `\( z_i \)`) are .stress[killed by the conditioning] with respect to `\( n_i \)`, since whenever .center[ `\( \lambda_{i, b} = e^{\inr{\bX_{i, b'}, \btheta} + \beta^\top z + c_b}, \)` ] we have .center[ `\(\displaystyle \frac{\lambda_{i, b}}{\sum_{b'=1}^B \lambda_{i, b'}} = \frac{e^{\inr{\bX_{i, b}, \btheta}}} {\sum_{b'=1}^B e^{\inr{\bX_{i, b'}, \btheta}}}. \)` ] --- # Methodology behind ConvSCCS ## A dedicated penalization technique - We want to consider a .stress[large number of lags] `\( K \)`, but we want to .stress[smooth the time-adjacent coefficients] `\( \theta_1^j, \ldots, \theta_B^j \)` - We use .stress[group total-variation penalization] .center[ `\( \displaystyle \gamma \sum_{i=1}^d \sum_{k=1}^{B-1} | \theta_{k}^j - \theta_{k-1}^j| \)` ] - We can use also group Lasso, to cancel full groups `\( \theta_{\bullet}^j = (\theta_1^j, \ldots, \theta_K^j ) \)` .center[ `\( \displaystyle \gamma \sum_{i=1}^d \sum_{k=1}^{B-1} | \theta_{k}^j - \theta_{k-1}^j| + \tau \sum_{i=1}^d \| \theta_{\bullet}^j \|_2, \)` ] - where `\( \gamma, \nu > 0 \)` are hyper-parameters and `\( \| \bullet \|_2 \)` is the Euclidean norm --- # Methodology behind ConvSCCS ## Putting things together and assuming `\(\bX_{i, b}\)` contains exposures indicators We minimization with respect to `\(\btheta\)` of .center[ `\( \displaystyle \frac 1n \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \Big( \displaystyle \frac{\lambda_{ik}(\phi, \btheta)}{\sum_{k'=1}^K \lambda_{ik}(\phi, \btheta)} \Big) + \gamma \sum_{j=1}^J \sum_{k=1}^{p-1} | \theta^j_{k+1} - \theta^j_{k} | \)` ] where .center[ `\(\displaystyle \lambda_{ik}(\phi, \theta) = \exp\Big(\phi_k + \sum_{j=1}^d \sum_{l=1}^{L_i^j} \theta_{k-c_{il}^j}^j \mathbf 1_{[0,p]} (k-c_{il}^j) \Big), \)` ] with - `\( \phi_k =\)` age effect and wehere we recall that `\( \theta_t^j =\)` effect of exposure to drug `\( j \)` after a period of `\( t\)` months --- # Methodology behind ConvSCCS ## Tips and tricks - Stratified `\( V \)`-fold cross-validation for `\( \lambda \)` - Very fast solver: SGD with variance reduction - Fast proximal operator for total-variation - Exploit sparsity of the `\( \bX_{i, b}\)` ## Available tools - Right-censoring - Other types of featuring in `\(\bX_{i, b} \)` - Features product (joint exposures to drugs) - Confidence intervals using parametric bootstrap ## Next generalizations - Multi-task (many adverse events at the same time) - Non-case series --- class: center, middle, inverse # 5. What's next and conclusions --- # Ongoing work ## Cohort of "old" persons - Drugs: .stress[large classes] (several dozens) - .stress[Many drug interactions] (many exposures with old persons) - Adverse effects: .stress[fall] (detected via fractures) ## Some figures - .stress[12 millions individuals] (compared to 2.5 for the pilot project) - `\(\approx\)` 1.6To per year (compared to 250Go for the pilot project) - 2 billons lines per year ## Some challenges - Scalability of our pipeline - Robustness of our model to short-term effects (pilot project was all about long term effects) --- # Ongoing work ## Cohort of patients with prostatic surgery - Partnership with the .stress[urology service at Tenon hospital] (Paris) - PhD of Y. Yu - .stress[All male patients in France] that had prostatic surgery - About .stress[200K men], many features (not only drug exposures) - .stress[Predict post-surgery complications] before the surgery - A very different approach, based on .stress[deep learning methods] - Direct use of the full .stress[pathways of patients, including all health events] ## Challenges - Very .stress[irregular / sparse] longitudinal features - Architecture based on .stress[irregular recurrent neural networks] models - Large number of medical codes (about 30K) --- class: center, middle, inverse # Thank you !