Remark

class: center, middle

# Deep Learning for Electronic Health Records

## Deep Learning in finance workshop - LPSM

.medium[Stéphane Gaïffas]

.medium[(with E. Bacry, I. Merad, M. Morel, A. Nitavskyi, Y. Yu)]

<br>

.center[
    <img src="figs/uparis.png" style="height: 150px;" />
    <img src="" style="width: 70px;" />
    <img src="figs/lpsm.png" style="height: 150px;" />
    <img src="" style="width: 70px;" /> 
    <img src="figs/ens.png" style="height: 150px;" />
]

---

# Disclaimer

<br>

.large[.center[No finance in this talk !]]

<br>

.center[
    <img src="figs/no-finance.jpg" style="width: 60%;" />
]

---

class: center, middle, inverse

# 1. Deep learning in healthcare

---

# Deep learning in healthcare

Is **mosty** about .stress[computer vision], .stress[natural language processing] or .stress[omics]

.center[
    <img src="figs/only-cv-nlp1.jpg" style="width: 60%;" />
]

---

# Deep learning in healthcare: CV

.center[
    <img src="figs/cv-health4.jpg" style="width: 100%;" />
]

.center[.tiny[From: "Dermatologist-level classification of skin cancer
with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]]

---

# Deep learning in healthcare: CV

.center[
    <img src="figs/cv-health3.jpg" style="width: 100%;" />
]

.center[.tiny[From: "Dermatologist-level classification of skin cancer
with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]]

---

# Deep learning in healthcare: CV

.center[
    <img src="figs/cv-health5.jpg" style="width: 100%;" />
]

.center[.tiny[From: "Dermatologist-level classification of skin cancer
with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]]

---

# Deep learning in healthcare: CV

.center[
    <img src="figs/cv-health2.jpg" style="width: 75%;" />
]

.center[.tiny[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)]]

---

# Deep learning in healthcare: CV

.center[
    <img src="figs/cv-health1.jpg" style="width: 95%;" />
]

.center[.tiny[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)]]

---

# Deep learning in healthcare: NLP

To cite but a **few**...

- Chen L. et al,  *Clinical trial cohort selection based on multi-level rule-based natural language processing system.*, J Am Med Inform Assoc. 2019
- Wang Y. et al, *Clinical information extraction applications: a literature review.*
J Biomed Inform. 2018; 77: 34-49"
- Sohn S., *Ascertainment of asthma prognosis using natural language processing from electronic medical records*. J Allergy Clin Immunol. 2018
- Sohn S. et al., *Analysis of clinical variations in asthma care documented in electronic health records between staff and resident physicians*. Stud Health Technol Inform. 2017
- Liang H. et al., *Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence*, Nat Med. 2019

---

class: center, middle, inverse

# 2. DL in healthcare: beyond CV and NLP

---

# Deep learning for EHRs

**Beyond** .stress[computer vision] and .stress[natural language processing] ?

<br>

.center[
    <img src="figs/mimic1.jpg" style="width: 80%;" />
]

.center[.tiny[From: "Benchmarking deep learning models on large healthcare datasets" (https://www.sciencedirect.com/science/article/pii/S1532046418300716)]]

---

# Deep learning for EHRs

What is the .stress[state-of-the-art] ?

.center[
    <img src="figs/health0.jpg" style="width: 75%;" />
]

.center[.tiny[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]]

---

# Deep learning for EHRs

What is the .stress[state-of-the-art] ?

.center[
    <img src="figs/health1.jpg" style="width: 80%;" />
]

.center[.tiny[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]]

---

# Deep learning for EHRs

What is the .stress[state-of-the-art] ?

.center[
    <img src="figs/health2.jpg" style="width: 90%;" />
]

<br>

.center[.tiny[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]]

---

class: center, middle, inverse

# 3. Self-advertisement

## Some of my projects

---

# The X-CNAM project

- CNAM = .stress[Caisse Primaire d'Assurance Maladie]
- With E. Bacry (CNRS, Univ. Paris Dauphine)
- A partnership between X and CNAM, ends this year (2017-2020)
- Data from **carte vitale** (almost all French citizens, many TeraBytes)

.center[
    <img src="figs/lemonde-cnam.jpg" style="width: 90%;" />
]

---

# The X-CNAM project

- CNAM = .stress[Caisse Primaire d'Assurance Maladie]
- With E. Bacry (CNRS, Univ. Paris Dauphine)
- A partnership between X and CNAM, ends this year (2017-2020)
- Data from **carte vitale** (almost all French citizens, many TeraBytes)

### Papers

- **SCALPEL3: a scalable open-source library for healthcare claims databases.** *International Journal of Medical Informatics*, 2020
- **Screening anxiolytics, hypnotics, antidepressants and neuroleptics for bone fracture risk among elderly: a nation-wide dynamic multivariate self-control study using the SNDS claims database**, *submitted to Drugs Safety*, 2020
- **ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection**, *Biostatistics*, 2019

---

# Prairie 3IA Chair "PERHAPS"

- **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies
- **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital)

.center[
    <img src="figs/prairie.jpg" style="width: 60%;" />
]

---

# Prairie 3IA Chair "PERHAPS"

- **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies
- **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital)

### Papers

- **ZiMM: a deep learning model for long term adverse events with non-clinical claims data**, *NeurIPS 2019, ML4H workshop* and *Journal of Biomedical Informatics*, 2020
- **Which attention model and unsupervised pretraining strategy for electronic health records ?** *submitted*
- **About contrastive unsupervised representation learning for classification and its convergence.** *submitted*

---

# Oscour

- Partnership with E. Bacry and **emergency doctors** from AP-HP 
- **Oscour database**: all arrivals at emergency services in Ile-de-France
- Aim: .stress[predict future hospitalizations]
- We are still waiting for **recent data** (including COVID-19...)
- And waiting for data from **whole France** 
- An ongoing work

.center[
    <img src="figs/urgences.jpg" style="width: 60%;" />
]

---

# ANR LabCOM with Califrais

- .stress[ANR LabCOM] ("laboratoire commun") obtained one month ago
- **LOPF project** (Large-scale Optimization of Product Flows) 
- With company **Califrais**

### Topics of research

- Large-scale modeling of price evolution of products
- Multi-site stock optimization through reinforcement learning
- Churn modeling and prediction using online methods
- Recommender systems for large-scale products/clients graph data

.center[
    <img src="figs/califrais.jpg" style="width: 50%;" />
]

---

# Some general-purpose DL techniques

I will describe in this talk .stress[very recent DL techniques] that can be **very useful beyond vision / NLP / healthcare applications**

**Architectures** involving .stress[attention mechanisms]

- *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236)
- *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794)

**Self-supervised learning** based on .stress[contrastive learning]

- *A Simple Framework for Contrastive Learning of Visual Representations* (https://arxiv.org/abs/2002.05709)
- *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733)

`$$
\newcommand{\R}{\mathbb R}
\newcommand{\cY}{\mathcal Y}
\newcommand{\cJ}{\mathcal J}
\newcommand{\bX}{\boldsymbol X}
\newcommand{\XB}{\boldsymbol X^B}
\newcommand{\TV}{\text{TV}}
\newcommand{\norm}[1]{\| #1 \|}
\newcommand{\inr}[1]{\langle #1 \rangle}
\newcommand{\ind}[1]{\mathbf 1_{#1}}
\DeclareMathOperator{\argmin}{argmin}
\DeclareMathOperator{\pen}{pen}
\DeclareMathOperator{\prox}{prox}
\DeclareMathOperator{\bina}{bina}
$$`

---

class: center, middle, inverse

# 4. Attention-based architectures

---

# From RNNs to transformers

- RNNs used to be the .stress[state-of-the art] for **machine translation**, **time series analysis**, and more generally any **sequence-to-sequence task**
- Then, **attention** was used .stress[inside recurrent layers] to improve their **long-range dependency**
- But RNNs are **hard to scale**: their **recurrent** nature .stress[hinders distributed computations]

A .stress[game changer] came in 2017:

- **Attention is all you need** by Vaswani et al. (2017) introduces the .stress[transformer] architecture (https://arxiv.org/abs/1706.03762)
- Many follow-ups since then...
- Core ingredient is the .stress[Multi-Head Self-Attention] layer

Led to things like

- BERT, Transformer-XL, GPT-3 (175 Billions of parameters !)

---

# GPT-3 examples

<img src="figs/gpt3-1.gif" style="width: 50%;" />
<img src="figs/gpt3-2.gif" style="width: 40%;" />

.center[
.small[https://app.inferkit.com/demo]
<img src="figs/gpt3-3.gif" style="width: 60%;" />
]

---

# Self-attention layer

- For the first layer, **input** is a sequence of .stress[token embeddings]
$$\mathbf{X} = [x_1, \dots , x_L]$$
where $x_i \in \R^d$
- **Output** is a same-length sequence of (hopefully) .stress[contextualized embeddings]
- For other layers, **input** is a sequence of **contextualized embedding vectors** (output of a previous self-attention layer)

Next displays from

- http://jalammar.github.io/illustrated-transformer

---

# Self-attention layer

It first computes .stress[key], .stress[queries] and .stress[values]:
`$$
\mathbf{Q} = \mathbf{X W}^Q, \quad \mathbf{K} = \mathbf{X W}^K, \quad \mathbf{V} = \mathbf{X W}^V
$$`
where 
`$$
\mathbf{W}^Q \in \R^{d \times d_k}, \quad
\mathbf{W}^K \in \R^{d \times d_k} \quad \text{and} \quad
\mathbf{W}^V \in \R^{d \times d_v}
$$`
are learned parameters and $\mathbf{X} \in \R^{d}$ is the input

.center[
    <img src="figs/attention2.jpg" style="width: 25%;" />
]

---

# Self-attention layer

And computes .stress[inner products] between **keys** and **queries** and applies .stress[softmax over rows]

`$$
\mathbf{Z} = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \right) \mathbf{V}
$$`

.center[
    <img src="figs/attention1.jpg" style="width: 60%;" />
]

---

# Multi-head self-attention layer

Combines $H$ .stress[heads] of **self-attention**
`$$
\begin{align*}
\mathbf{Q}_h &= \mathbf{X W}_h^Q, \quad \mathbf{K}_h = \mathbf{X W}_h^K, \quad \mathbf{V}_h = \mathbf{X W}_h^V \\
\mathbf{Z}_h &= \text{softmax}\left( \frac{\mathbf{Q}_h \mathbf{K}_h^\top}{\sqrt{d_k}} \right) \mathbf{V}_h \\
\text{MSA}(\mathbf{X}) &= [\mathbf{Z}_1 \; \cdots \; \mathbf{Z}_H] \; \mathbf{W}^O
\end{align*}
$$`

.center[
    <img src="figs/transformer2.jpg" style="width: 80%;" />
]

---

# Multi-head self-attention layer

**Visualization** of the .stress[softmax matrix]

.center[
    <img src="figs/attention4.jpg" style="width: 55%;" />
]

.center[
    <img src="figs/attention5.jpg" style="width: 45%;" />
]

.center[.tiny[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html]]

Solves, among many others things **coreference resolution** (a difficult problem in machine translation)

---

# Transformer architecture

The .stress[encoder] stacks several MSA layers as follows:
`$$
\begin{align*}
\mathbf{Y}_k &= \text{LayerNorm}(\mathbf{X}_k + \text{MSA}(\mathbf{X}_k)) \\
\mathbf{X}_{k+1} &= \text{LayerNorm}(\mathbf{Y}_k + \text{FF}(\mathbf{Y}_k))
\end{align*}
$$`
where 
- $\text{FF}$ is a **dense layer** (feed-forward)
- $\mathbf{X}_1 = \mathbf{e}=$ the sequence of $L$ token embeddings
- $\mathbf{X}_k \in \R^{L \times d_k}$ is the input of the $k$-th layer
- $\mathbf{X}_{k+1} \in \R^{L \times d_k}$ is the output of the $k$-th layer

---

# Transformer architecture

.stress[Layer normalization] versus **batch normalization**

.center[
    <img src="figs/layer-norm.jpeg" style="width: 90%;" />
]

---

# Transformer architecture

Usually uses an .stress[encoder / decoder architecture]

.center[
    <img src="figs/transformer1.jpg" style="width: 40%;" />
]

---

# Transformer architecture

Usually uses an .stress[encoder / decoder architecture]
.center[
    <img src="figs/transformers3.gif" style="width: 60%;" />
]
.center[.tiny[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html]]

---

# Positional embeddings

- As such, **token embeddings** do not change with their .stress[position] in the sequence
- A strategy is .stress[positional embeddings]: either **deterministic** or **trained**
- Just add each **positional embedding** to each **token embedding** before pushing the tensor in the architecture
- Original implementation uses **512 cosines and sines**
- Example with only 2 cosines and sines:

.center[
    <img src="figs/positional_encoding.jpg" style="width: 90%;" />
]

---

# Problem: quadratic complexity of MSA

- The MSA layer has .stress[memory and computational complexity] $O(L^2 d)$
- Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large)
- Some **follow-up works** propose strategies to solve this

<br>

.center[
    <img src="figs/local-sparse-attention.jpg" style="width: 70%;" />
]

---

# Problem: quadratic complexity of MSA

- The MSA layer has .stress[memory and computational complexity] $O(L^2 d)$
- Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large)
- Some **follow-up works** propose strategies to solve this

**Graph Neural Networks** and **Graph Attention Networks**

- *Graph Attention Networks* (https://arxiv.org/abs/1710.10903)
- *Graph Neural Networks: A Review of Methods and Applications* (https://arxiv.org/abs/1812.08434)

--

**Linear transformers**

- *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236)
- *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794)

---

# Problem: quadratic complexity of MSA

**Bottleneck** is the computation of the .stress[softmax attention]

`$$
\mathbf{Z} = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d}} \right) \mathbf{V}
$$`
that we can rewrite more generally as
`$$
Z_i = \frac{\sum_{j=1}^L \text{sim}(Q_i, K_j) V_j}{\sum_{j=1}^L \text{sim}(Q_i, K_j)}
$$`
for `$i=1, \ldots, L$` where 
`$$
\text{sim}(q, k) = \exp\left( \frac{q^\top k}{\sqrt d} \right)
$$`

---

# Linear transformers

- https://arxiv.org/abs/2006.16236 uses a .stress[kernel trick] 
- Replace `$\text{sim}(q, k)$` by
`$$
\text{sim}(q, k) = \phi(q)^\top \phi(k)
$$`
for a **feature mapping** $\phi$
- And consider in practice just a **simple activation function**
`$$
\phi(z) = \text{elu}(z) + 1
$$`
where `$\text{elu}(z) = z$` if `$z > 0$` and `$\text{elu}(z) = \alpha (e^z - 1)$` if `$z < 0$`
- **Solves** the .stress[memory] and .stress[computational bottlenecks] because of
`$$
Z_i = \frac{\sum_{j=1}^L \phi(Q_i)^\top \phi(K_j) V_j}{\sum_{j=1}^L \phi(Q_i)^\top \phi(K_j)} 
= \frac{\phi(Q_i)^\top \sum_{j=1}^L \phi(K_j) V_j}{\phi(Q_i)^\top \sum_{j=1}^L \phi(K_j)}
$$`
- No need to .stress[compute explicitly] the **attention matrix** anymore !

---

# Linear transformers

- https://arxiv.org/abs/2009.14794 uses the same kernel trick
- But uses .stress[random projections] (**positive random features** for softmax)
- The trick relies on the following simple remark:
`$$
\text{sim}(q, k) =  e^{q^\top k} = e^{\frac 12 \| q \|^2}  K(q, k) e^{\frac 12 \| k \|^2 }
$$`
where `$K(q, k) = e^{- \frac 12 \| q - k \|^2}$` is the .stress[Gaussian kernel] so that
`$$
\begin{align*}
\text{sim}(q, k) &= e^{-\frac 12 (\| q \|^2 + \| k \|^2)} \; \mathbb E_{\omega \sim \mathcal N(0, \mathbf I_d)} \left[ e^{\omega^\top (q + k)} \right] \\
&\approx e^{-\frac 12 (\| q \|^2 + \| k \|^2)} \; \frac 1M \sum_{i=1}^M e^{\omega_m^\top (q + k)}
\end{align*}
$$`
with `$\omega_1, \ldots, \omega_M$` i.i.d `$\mathcal N(0, \mathbf I_d)$`

---

# Linear transformers

- https://arxiv.org/abs/2009.14794

<br>

.center[
    <img src="figs/performer2.jpg" style="width: 95%;" />
]

---

class: center, middle, inverse

# 5. Application to health sequences

---

# Application to health sequences

We can use either a .stress[sequence] or a .stress[graph] structure

.center[
    <img src="figs/record_as_graphs.pdf" style="width: 90%;" />
]

---

# Application to health sequences

We use .stress[encoder-only] architectures

.center[
    <img src="figs/framework_model.pdf" style="width: 90%;" />
]

---

# Application to health sequences

- **ZiMM: a deep learning model for long term adverse events with non-clinical claims data**, *NeurIPS 2019, ML4H workshop* and *Journal of Biomedical Informatics*, 2020

.center[
    <img src="figs/zimm-results.jpg" style="width: 100%;" />
]

---

class: center, middle, inverse

# 6. Self-supervised learning

---

# Self-supervised learning

use .stress[pretext tasks] hence the name **self-supervised**. 
For **NLP** a strategy called **Masked Language Modeling** does the following:

- Selects **15% of the tokens at random** in a sequence
- among them, **replace 80% by the `MASK` token**, **10% by a random code** and leave the **remaining 10% unchanged**
- .stress[Predict the token] hidden behind the `MASK` token

This self-supervised strategy is one of the core ingredient of BERT

- *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding* (https://arxiv.org/abs/1810.04805)

Other strategies involve **sequence order** prediction, among others

---

# Self-supervised learning

Recent .stress[very impressive results] in **computer vision**

.center[
    <img src="figs/byol-results.jpg" style="width: 85%;" />
]

.center[.tiny[From https://arxiv.org/abs/2006.07733]]

Let's explain *A Simple Framework for Contrastive Learning of Visual Representations* (https://arxiv.org/abs/2002.05709)

---

# Self-supervised learning

The **main ingredients** for .stress[self-supervised learning] (SimCLR version)

- A **stochastic data augmentation** module. Transforms each input `$x_i$` into randomly data-augmented versions `$\tilde x_i$` and `$\tilde x_j$`. The pair `$(\tilde x_i, \tilde x_j)$` is called a **positive pair**.

.center[
    <img src="figs/data-augmentation.jpg" style="width: 75%;" />
]

.center[.tiny[From https://arxiv.org/abs/2002.05709]]

---

# Self-supervised learning

The **main ingredients** for .stress[self-supervised learning] (SimCLR version)

- An **encoder** `$f(\cdot)$` (for instance a ResNet50) that we want to train. We compute with it `$h_i = f(\tilde x_i)$` and `$h_j = f(\tilde x_j)$`
- A **projection head** `$g(\cdot)$` given by a simple feed-forward network, such as a 1-hidden layer network 
`$$
z_i  = g(h_i) = \textbf W^{(2)} \; \text{ReLU}( \mathbf W^{(1)} \; h_i))
$$`
- Create data-augmentations pairs `$\{ \tilde x_k \}_{k=1, \ldots, 2N}$` of the size `$N$`  mini-batch. On a positive pair `$(i, j)$` we compute the **contrastive loss**
`$$
\ell(i, j) = -\log \left( \frac{e^{\text{sim}(z_i, z_j) / \tau}}{\sum_{k=1}^{2N} \mathbf 1_{k \neq i} e^{\text{sim}(z_i, z_k) / \tau}} \right)
$$`
where `$\text{sim}(u, v) = u^\top v / \| u \| \| v \|$` is the cosine similarity

---

# Self-supervised learning

The **main ingredients** for .stress[self-supervised learning] (SimCLR version)

- The loss on the data-augmented mini-batch `$\{ \tilde x_k \}_{k=1, \ldots, 2N}$` is given by 
`$
\frac{1}{2N} \sum_{k=1}^N \left( \ell(2 k - 1, 2 k) + \ell(2 k, 2 k - 1) \right)
$`

.center[
    <img src="figs/simclr-archi.jpg" style="width: 50%;" />
]

.center[.tiny[From https://arxiv.org/abs/2002.05709]]

---

# Self-supervised learning

The **main ingredients** for .stress[self-supervised learning] (SimCLR version)

.center[
    <img src="figs/simclr.gif" style="width: 45%;" />
]

.center[.tiny[From https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html]]

---

# Self-supervised learning

- This version of self-supervised learning requires the use of .stress[large mini-batches] 
- So that **enough negatives** are used in the contrastive loss
- .stress[Strong computational] and .stress[memory] footprint

A convincing **alternative approach** is:

- *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733)

Some mostly remaining **open topics**

- **Self-supervised learning** for .stress[electronic health records] ?
- Data-augmentation strategies **beyond computer vision** ?
- What about .stress[structured time series] ?

---

class: center, middle, inverse

# Thank You !