Deep Learning for Electronic Health Records

# Deep Learning for Electronic Health Records

## DU IA appliqué à la santé

**Stéphane Gaïffas** 
[https://stephanegaiffas.github.io](https://stephanegaiffas.github.io)

.grid[
.kol-1-4[.width-55[![](figures/up.svg)]]
.kol-1-4[.width-100[![](figures/lpsm.svg)]]
.kol-1-4[.width-45[![](figures/ens.svg)]]
.kol-1-4[.width-100[![](figures/prairie.svg)]]
]

---

# 1. Deep learning in healthcare

---

# Deep learning in healthcare

Is mosty about **computer vision** and **natural language processing**

---

# Deep learning in healthcare

## Computer vision

.footnote[From: "Dermatologist-level classification of skin cancer
with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]

---

# Deep learning in healthcare

## Computer vision

.footnote[From: "Dermatologist-level classification of skin cancer
with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]

---

# Deep learning in healthcare

## Computer vision

.footnote[From: "Dermatologist-level classification of skin cancer
with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]

---

# Deep learning in healthcare

## Computer vision

.footnote[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)]

---

# Deep learning in healthcare

## Computer vision

.footnote[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)]

---

# Deep learning in healthcare

## Natural Language Processing

To cite but a **few**...

- Chen L. et al,  *Clinical trial cohort selection based on multi-level rule-based natural language processing system.*, J Am Med Inform Assoc. 2019
- Wang Y. et al, *Clinical information extraction applications: a literature review.*
J Biomed Inform. 2018; 77: 34-49"
- Sohn S., *Ascertainment of asthma prognosis using natural language processing from electronic medical records*. J Allergy Clin Immunol. 2018
- Sohn S. et al., *Analysis of clinical variations in asthma care documented in electronic health records between staff and resident physicians*. Stud Health Technol Inform. 2017
- Liang H. et al., *Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence*, Nat Med. 2019

---

# 2. Beyond CV and NLP for healthcare

---

# Deep learning for EHRs

EHRs = Electronic Healthcare Records

.footnote[From: "Benchmarking deep learning models on large healthcare datasets" (https://www.sciencedirect.com/science/article/pii/S1532046418300716)]

---

# Deep learning for EHRs

What is the **state-of-the-art** ?

.footnote[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]

---

# Deep learning for EHRs

What is the **state-of-the-art** ?

.footnote[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]

---

# Deep learning for EHRs

What is the **state-of-the-art** ?

.footnote[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]

---

# 3. Self-advertisements

---

# The X-CNAM project

- CNAM = **Caisse Primaire d'Assurance Maladie**
- With E. Bacry (CNRS, Univ. Paris Dauphine)
- A partnership between X and CNAM
- Data from **carte vitale** (almost all French citizens, many TeraBytes)

---

# The X-CNAM project

### Papers

- **SCALPEL3: a scalable open-source library for healthcare claims databases.** *International Journal of Medical Informatics*, 2020
- **Screening anxiolytics, hypnotics, antidepressants and neuroleptics for bone fracture risk among elderly: a nation-wide dynamic multivariate self-control study using the SNDS claims database**, *submitted to Drugs Safety*
- **ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection**, *Biostatistics*, 2019

---

# Prairie 3IA Chair "PERHAPS"

- **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies
- **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital)

---

# Prairie 3IA Chair "PERHAPS"

- **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies
- **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital)

### Papers

- **ZiMM: a deep learning model for long term adverse events with non-clinical claims data**, *NeurIPS 2019, ML4H workshop* and *Journal of Biomedical Informatics*, 2020
- **Which attention model and unsupervised pretraining strategy for electronic health records ?** *submitted*
- **About contrastive unsupervised representation learning for classification and its convergence.** *submitted*

---

# General-purpose DL techniques

We'll describe **recent DL techniques** that can be **very useful** for EHRs

**Architectures** involving **attention mechanisms**

- *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236)
- *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794)

**Self-supervised learning** based on **contrastive learning**

- *A Simple Framework for Contrastive Learning of Visual Representations* (https://arxiv.org/abs/2002.05709)
- *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733)

---

# 4. Attention-based architectures

---

# From RNNs to transformers

- RNNs used to be the **state-of-the art** for **machine translation**, **time series analysis**, and more generally any **sequence-to-sequence task**
- Then, **attention** was used **inside recurrent layers** to improve their **long-range dependency**
- But RNNs are **hard to scale**: their **recurrent** nature **hinders distributed computations**

A **game changer** came in 2017:

- **Attention is all you need** by Vaswani et al. (2017) introduces the **transformer** architecture (https://arxiv.org/abs/1706.03762)
- Many follow-ups since then...
- Core ingredient is the **Multi-Head Self-Attention** layer

Led to things like

- BERT, Transformer-XL, GPT-3 (175 Billions of parameters !)

---

# GPT-3 examples

.center[
 <img src="figures/gpt3-3.gif" style="width: 40%;" />
]
.center[
 .small[https://app.inferkit.com/demo] 
]

---

# Self-attention layer

- For the first layer, **input** is a sequence of **token embeddings**
$$\mathbf X = [x\_1, \ldots, x\_L]$$ 
where $x\_i \in \mathbb R^d$
- **Output** is a same-length sequence of (hopefully) **contextualized embeddings**
- For other layers, **input** is a sequence of **contextualized embedding vectors** (output of a previous self-attention layer)

Next displays from

- http://jalammar.github.io/illustrated-transformer

---

# Self-attention layer

It first computes **key**, **queries** and **values**:
$$
\mathbf{Q} = \mathbf{X W}^Q, \quad \mathbf{K} = \mathbf{X W}^K, \quad \mathbf{V} = \mathbf{X W}^V
$$
where 
$$
\mathbf{W}^Q \in \mathbb R^{d \times d\_k}, \quad
\mathbf{W}^K \in \mathbb R^{d \times d\_k} \quad \text{and} \quad
\mathbf{W}^V \in \mathbb R^{d \times d\_v}
$$
are learned parameters and $\mathbf{X} \in \mathbb R^{d}$ is the input

---

# Self-attention layer

And computes **inner products** between **keys** and **queries** and applies **softmax over rows**

$$
\mathbf{Z} = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d\_k}} \right) \mathbf{V}
$$

---

# Multi-head self-attention layer

Combines $H$ **heads** of **self-attention**
$$
\mathbf{Q}\_h = \mathbf{X W}\_h^Q, \quad \mathbf{K}\_h = \mathbf{X W}\_h^K, \quad \mathbf{V}\_h = \mathbf{X W}\_h^V
$$
$$
\mathbf{Z}\_h = \text{softmax}\left( \frac{\mathbf{Q}\_h \mathbf{K}\_h^\top}{\sqrt{d\_k}} \right) \mathbf{V}\_h
$$
$$
\text{MSA}(\mathbf{X}) = [\mathbf{Z}\_1 \; \cdots \; \mathbf{Z}\_H] \; \mathbf{W}^O
$$

---

# Multi-head self-attention layer

**Visualization** of the **softmax matrix**

Solves, among many others things **coreference resolution** (a difficult problem in machine translation)

---

# Transformer architecture

The **encoder** stacks several MSA layers as follows:
$$
\mathbf{Y}\_k = \text{LayerNorm}(\mathbf{X}\_k + \text{MSA}(\mathbf{X}\_k))
$$
$$
\mathbf{X}\_{k+1} = \text{LayerNorm}(\mathbf{Y}\_k + \text{FF}(\mathbf{Y}\_k))
$$

where 
- $\text{FF}$ is a **dense** or **feed-forward** network
- $\mathbf{X}\_1 = \mathbf{e}=$ the sequence of $L$ token embeddings
- $\mathbf{X}\_k \in \mathbb R^{L \times d\_k}$ is the input of the $k$-th layer
- $\mathbf{X}\_{k+1} \in \mathbb R^{L \times d\_k}$ is the output of the $k$-th layer

---

# Transformer architecture

**Layer normalization** versus **batch normalization**

---

# Transformer architecture

Usually uses an **encoder / decoder architecture**

---

# Transformer architecture

Usually uses an **encoder / decoder architecture**
.center[
 <img src="figures/transformers3.gif" style="width: 80%;" />
]
.footnote[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html]

---

# Positional embeddings

- As such, **token embeddings** do not change with their **position** in the sequence
- A strategy is **positional embeddings**: either **deterministic** or **trained**
- Just add each **positional embedding** to each **token embedding** before pushing the tensor in the architecture
- Original implementation uses **512 cosines and sines**
- Example with only 2 cosines and sines:

---

# Quadratic complexity of MSA

- The MSA layer has **memory and computational complexity** $O(L^2 d)$
- Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large)
- Some **follow-up works** propose strategies to solve this

---

# Quadratic complexity of MSA

**Graph Neural Networks** and **Graph Attention Networks**

- *Graph Attention Networks* (https://arxiv.org/abs/1710.10903)
- *Graph Neural Networks: A Review of Methods and Applications* (https://arxiv.org/abs/1812.08434)

**Linear transformers**

- *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236)
- *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794)

---

# 5. Application to healthcare pathways

---

# Healthcare pathways

---

# Healthcare pathways

**Encoder-only** architectures

---

# Healthcare pathways

Allows to **improve strong baselines**

.footnote[From: "ZiMM: a deep learning model for long term adverse events with non-clinical claims data"]

---

# 6. Self-supervised learning

---

# Self-supervised learning

Use **pretext tasks** hence the name **self-supervised**. 
For **NLP** a strategy called **Masked Language Modeling** does the following:

- Selects **15% of the tokens at random** in a sequence
- among them, **replace 80% by the `MASK` token**, **10% by a random code** and leave the **remaining 10% unchanged**
- **Predict the token** hidden behind the `MASK` token

One of the core ingredient of BERT

- *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding* (https://arxiv.org/abs/1810.04805)

Other strategies involve **sequence order** prediction, among others

---

# Self-supervised learning

Recent **very impressive results** in **computer vision**

Let's explain "A Simple Framework for Contrastive Learning of Visual Representations" 
(https://arxiv.org/abs/2002.05709)

---

# Self-supervised learning

The **main ingredients** for **self-supervised learning** (SimCLR version)

- A **stochastic data augmentation** module. Transforms each input $x\_i$ into randomly data-augmented versions $\tilde x\_i$ and $\tilde x\_j$. The pair $(\tilde x\_i, \tilde x\_j)$ is called a **positive pair**.

---

# Self-supervised learning

The **main ingredients** for **self-supervised learning** (SimCLR version)

- An **encoder** $f(\cdot)$ (for instance a ResNet50) that we want to train. We compute with it $h\_i = f(\tilde x\_i)$ and $h\_j = f(\tilde x\_j)$
- A **projection head** $g(\cdot)$ given by a simple feed-forward network, such as a 1-hidden layer network 
$$
z\_i  = g(h\_i) = \textbf W^{(2)} \; \text{ReLU}( \mathbf W^{(1)} \; h\_i))
$$
- Create **data-augmentations** pairs $\\{ \tilde x\_k \\}\_{k=1, \ldots, 2N}$ of the size $N$  mini-batch. On a positive pair $(i, j)$ we compute the **contrastive loss**
$$
\ell(i, j) = -\log \left( \frac{e^{\text{sim}(z\_i, z\_j) / \tau}}{\sum\_{k=1}^{2N} \mathbf 1\_{k \neq i} e^{\text{sim}(z\_i, z\_k) / \tau}} \right)
$$
where $\text{sim}(u, v) = u^\top v / \\| u \\| \\| v \\|$ is the cosine similarity

---

# Self-supervised learning

The **main ingredients** for **self-supervised learning** (SimCLR version)

- The loss on the data-augmented mini-batch $\\{ \tilde x\_k \\}\_{k=1, \ldots, 2N}$ is given by 
$
\frac{1}{2N} \sum\_{k=1}^N \left( \ell(2 k - 1, 2 k) + \ell(2 k, 2 k - 1) \right)
$

---

# Self-supervised learning

---

# Self-supervised learning

- This version of self-supervised learning requires the use of **large mini-batches**
- So that **enough negatives** are used in the contrastive loss
- **Strong computational** and **memory** footprint

A convincing **alternative approach** is:

- *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733)

---

# Conclusion

Some **mostly open** topics in deep learning applied to EHRs:

- **Self-supervised learning** for EHRs ?
- In particular, how to perform **data-augmentation** ?
- Impact of the **sparsity** and the **irregularity** of healthcare pathways on general-purpose models (RNN, transformers with positional encoding)

---

# Thank you !