class: center, middle # Deep Learning for Electronic Health Records ## Deep Learning in finance workshop - LPSM .medium[Stéphane Gaïffas] .medium[(with E. Bacry, I. Merad, M. Morel, A. Nitavskyi, Y. Yu)]
.center[
] --- # Disclaimer
.large[.center[No finance in this talk !]]
.center[
] --- class: center, middle, inverse # 1. Deep learning in healthcare --- # Deep learning in healthcare Is **mosty** about .stress[computer vision], .stress[natural language processing] or .stress[omics] .center[
] --- # Deep learning in healthcare: CV .center[
] .center[.tiny[From: "Dermatologist-level classification of skin cancer with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]] --- # Deep learning in healthcare: CV .center[
] .center[.tiny[From: "Dermatologist-level classification of skin cancer with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]] --- # Deep learning in healthcare: CV .center[
] .center[.tiny[From: "Dermatologist-level classification of skin cancer with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)]] --- # Deep learning in healthcare: CV .center[
] .center[.tiny[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)]] --- # Deep learning in healthcare: CV .center[
] .center[.tiny[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)]] --- # Deep learning in healthcare: NLP To cite but a **few**... - Chen L. et al, *Clinical trial cohort selection based on multi-level rule-based natural language processing system.*, J Am Med Inform Assoc. 2019 - Wang Y. et al, *Clinical information extraction applications: a literature review.* J Biomed Inform. 2018; 77: 34-49" - Sohn S., *Ascertainment of asthma prognosis using natural language processing from electronic medical records*. J Allergy Clin Immunol. 2018 - Sohn S. et al., *Analysis of clinical variations in asthma care documented in electronic health records between staff and resident physicians*. Stud Health Technol Inform. 2017 - Liang H. et al., *Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence*, Nat Med. 2019 --- class: center, middle, inverse # 2. DL in healthcare: beyond CV and NLP --- # Deep learning for EHRs **Beyond** .stress[computer vision] and .stress[natural language processing] ?
.center[
] .center[.tiny[From: "Benchmarking deep learning models on large healthcare datasets" (https://www.sciencedirect.com/science/article/pii/S1532046418300716)]] --- # Deep learning for EHRs What is the .stress[state-of-the-art] ? .center[
] .center[.tiny[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]] --- # Deep learning for EHRs What is the .stress[state-of-the-art] ? .center[
] .center[.tiny[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]] --- # Deep learning for EHRs What is the .stress[state-of-the-art] ? .center[
]
.center[.tiny[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)]] --- class: center, middle, inverse # 3. Self-advertisement ## Some of my projects --- # The X-CNAM project - CNAM = .stress[Caisse Primaire d'Assurance Maladie] - With E. Bacry (CNRS, Univ. Paris Dauphine) - A partnership between X and CNAM, ends this year (2017-2020) - Data from **carte vitale** (almost all French citizens, many TeraBytes) .center[
] --- # The X-CNAM project - CNAM = .stress[Caisse Primaire d'Assurance Maladie] - With E. Bacry (CNRS, Univ. Paris Dauphine) - A partnership between X and CNAM, ends this year (2017-2020) - Data from **carte vitale** (almost all French citizens, many TeraBytes) ### Papers - **SCALPEL3: a scalable open-source library for healthcare claims databases.** *International Journal of Medical Informatics*, 2020 - **Screening anxiolytics, hypnotics, antidepressants and neuroleptics for bone fracture risk among elderly: a nation-wide dynamic multivariate self-control study using the SNDS claims database**, *submitted to Drugs Safety*, 2020 - **ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection**, *Biostatistics*, 2019 --- # Prairie 3IA Chair "PERHAPS" - **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies - **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital) .center[
] --- # Prairie 3IA Chair "PERHAPS" - **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies - **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital) ### Papers - **ZiMM: a deep learning model for long term adverse events with non-clinical claims data**, *NeurIPS 2019, ML4H workshop* and *Journal of Biomedical Informatics*, 2020 - **Which attention model and unsupervised pretraining strategy for electronic health records ?** *submitted* - **About contrastive unsupervised representation learning for classification and its convergence.** *submitted* --- # Oscour - Partnership with E. Bacry and **emergency doctors** from AP-HP - **Oscour database**: all arrivals at emergency services in Ile-de-France - Aim: .stress[predict future hospitalizations] - We are still waiting for **recent data** (including COVID-19...) - And waiting for data from **whole France** - An ongoing work .center[
] --- # ANR LabCOM with Califrais - .stress[ANR LabCOM] ("laboratoire commun") obtained one month ago - **LOPF project** (Large-scale Optimization of Product Flows) - With company **Califrais** ### Topics of research - Large-scale modeling of price evolution of products - Multi-site stock optimization through reinforcement learning - Churn modeling and prediction using online methods - Recommender systems for large-scale products/clients graph data .center[
] --- # Some general-purpose DL techniques I will describe in this talk .stress[very recent DL techniques] that can be **very useful beyond vision / NLP / healthcare applications** **Architectures** involving .stress[attention mechanisms] - *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236) - *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794) **Self-supervised learning** based on .stress[contrastive learning] - *A Simple Framework for Contrastive Learning of Visual Representations* (https://arxiv.org/abs/2002.05709) - *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733) `$$ \newcommand{\R}{\mathbb R} \newcommand{\cY}{\mathcal Y} \newcommand{\cJ}{\mathcal J} \newcommand{\bX}{\boldsymbol X} \newcommand{\XB}{\boldsymbol X^B} \newcommand{\TV}{\text{TV}} \newcommand{\norm}[1]{\| #1 \|} \newcommand{\inr}[1]{\langle #1 \rangle} \newcommand{\ind}[1]{\mathbf 1_{#1}} \DeclareMathOperator{\argmin}{argmin} \DeclareMathOperator{\pen}{pen} \DeclareMathOperator{\prox}{prox} \DeclareMathOperator{\bina}{bina} $$` --- class: center, middle, inverse # 4. Attention-based architectures --- # From RNNs to transformers - RNNs used to be the .stress[state-of-the art] for **machine translation**, **time series analysis**, and more generally any **sequence-to-sequence task** - Then, **attention** was used .stress[inside recurrent layers] to improve their **long-range dependency** - But RNNs are **hard to scale**: their **recurrent** nature .stress[hinders distributed computations] A .stress[game changer] came in 2017: - **Attention is all you need** by Vaswani et al. (2017) introduces the .stress[transformer] architecture (https://arxiv.org/abs/1706.03762) - Many follow-ups since then... - Core ingredient is the .stress[Multi-Head Self-Attention] layer Led to things like - BERT, Transformer-XL, GPT-3 (175 Billions of parameters !) --- # GPT-3 examples
.center[ .small[https://app.inferkit.com/demo]
] --- # Self-attention layer - For the first layer, **input** is a sequence of .stress[token embeddings] $$\mathbf{X} = [x_1, \dots , x_L]$$ where $x_i \in \R^d$ - **Output** is a same-length sequence of (hopefully) .stress[contextualized embeddings] - For other layers, **input** is a sequence of **contextualized embedding vectors** (output of a previous self-attention layer) Next displays from - http://jalammar.github.io/illustrated-transformer --- # Self-attention layer It first computes .stress[key], .stress[queries] and .stress[values]: `$$ \mathbf{Q} = \mathbf{X W}^Q, \quad \mathbf{K} = \mathbf{X W}^K, \quad \mathbf{V} = \mathbf{X W}^V $$` where `$$ \mathbf{W}^Q \in \R^{d \times d_k}, \quad \mathbf{W}^K \in \R^{d \times d_k} \quad \text{and} \quad \mathbf{W}^V \in \R^{d \times d_v} $$` are learned parameters and $\mathbf{X} \in \R^{d}$ is the input .center[
] --- # Self-attention layer And computes .stress[inner products] between **keys** and **queries** and applies .stress[softmax over rows] `$$ \mathbf{Z} = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d_k}} \right) \mathbf{V} $$` .center[
] --- # Multi-head self-attention layer Combines $H$ .stress[heads] of **self-attention** `$$ \begin{align*} \mathbf{Q}_h &= \mathbf{X W}_h^Q, \quad \mathbf{K}_h = \mathbf{X W}_h^K, \quad \mathbf{V}_h = \mathbf{X W}_h^V \\ \mathbf{Z}_h &= \text{softmax}\left( \frac{\mathbf{Q}_h \mathbf{K}_h^\top}{\sqrt{d_k}} \right) \mathbf{V}_h \\ \text{MSA}(\mathbf{X}) &= [\mathbf{Z}_1 \; \cdots \; \mathbf{Z}_H] \; \mathbf{W}^O \end{align*} $$` .center[
] --- # Multi-head self-attention layer **Visualization** of the .stress[softmax matrix] .center[
] .center[
] .center[.tiny[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html]] Solves, among many others things **coreference resolution** (a difficult problem in machine translation) --- # Transformer architecture The .stress[encoder] stacks several MSA layers as follows: `$$ \begin{align*} \mathbf{Y}_k &= \text{LayerNorm}(\mathbf{X}_k + \text{MSA}(\mathbf{X}_k)) \\ \mathbf{X}_{k+1} &= \text{LayerNorm}(\mathbf{Y}_k + \text{FF}(\mathbf{Y}_k)) \end{align*} $$` where - $\text{FF}$ is a **dense layer** (feed-forward) - $\mathbf{X}_1 = \mathbf{e}=$ the sequence of $L$ token embeddings - $\mathbf{X}_k \in \R^{L \times d_k}$ is the input of the $k$-th layer - $\mathbf{X}_{k+1} \in \R^{L \times d_k}$ is the output of the $k$-th layer --- # Transformer architecture .stress[Layer normalization] versus **batch normalization** .center[
] --- # Transformer architecture Usually uses an .stress[encoder / decoder architecture] .center[
] --- # Transformer architecture Usually uses an .stress[encoder / decoder architecture] .center[
] .center[.tiny[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html]] --- # Positional embeddings - As such, **token embeddings** do not change with their .stress[position] in the sequence - A strategy is .stress[positional embeddings]: either **deterministic** or **trained** - Just add each **positional embedding** to each **token embedding** before pushing the tensor in the architecture - Original implementation uses **512 cosines and sines** - Example with only 2 cosines and sines: .center[
] --- # Problem: quadratic complexity of MSA - The MSA layer has .stress[memory and computational complexity] $O(L^2 d)$ - Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large) - Some **follow-up works** propose strategies to solve this
.center[
] --- # Problem: quadratic complexity of MSA - The MSA layer has .stress[memory and computational complexity] $O(L^2 d)$ - Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large) - Some **follow-up works** propose strategies to solve this **Graph Neural Networks** and **Graph Attention Networks** - *Graph Attention Networks* (https://arxiv.org/abs/1710.10903) - *Graph Neural Networks: A Review of Methods and Applications* (https://arxiv.org/abs/1812.08434) -- **Linear transformers** - *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236) - *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794) --- # Problem: quadratic complexity of MSA **Bottleneck** is the computation of the .stress[softmax attention] `$$ \mathbf{Z} = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d}} \right) \mathbf{V} $$` that we can rewrite more generally as `$$ Z_i = \frac{\sum_{j=1}^L \text{sim}(Q_i, K_j) V_j}{\sum_{j=1}^L \text{sim}(Q_i, K_j)} $$` for `$i=1, \ldots, L$` where `$$ \text{sim}(q, k) = \exp\left( \frac{q^\top k}{\sqrt d} \right) $$` --- # Linear transformers - https://arxiv.org/abs/2006.16236 uses a .stress[kernel trick] - Replace `$\text{sim}(q, k)$` by `$$ \text{sim}(q, k) = \phi(q)^\top \phi(k) $$` for a **feature mapping** $\phi$ - And consider in practice just a **simple activation function** `$$ \phi(z) = \text{elu}(z) + 1 $$` where `$\text{elu}(z) = z$` if `$z > 0$` and `$\text{elu}(z) = \alpha (e^z - 1)$` if `$z < 0$` - **Solves** the .stress[memory] and .stress[computational bottlenecks] because of `$$ Z_i = \frac{\sum_{j=1}^L \phi(Q_i)^\top \phi(K_j) V_j}{\sum_{j=1}^L \phi(Q_i)^\top \phi(K_j)} = \frac{\phi(Q_i)^\top \sum_{j=1}^L \phi(K_j) V_j}{\phi(Q_i)^\top \sum_{j=1}^L \phi(K_j)} $$` - No need to .stress[compute explicitly] the **attention matrix** anymore ! --- # Linear transformers - https://arxiv.org/abs/2009.14794 uses the same kernel trick - But uses .stress[random projections] (**positive random features** for softmax) - The trick relies on the following simple remark: `$$ \text{sim}(q, k) = e^{q^\top k} = e^{\frac 12 \| q \|^2} K(q, k) e^{\frac 12 \| k \|^2 } $$` where `$K(q, k) = e^{- \frac 12 \| q - k \|^2}$` is the .stress[Gaussian kernel] so that `$$ \begin{align*} \text{sim}(q, k) &= e^{-\frac 12 (\| q \|^2 + \| k \|^2)} \; \mathbb E_{\omega \sim \mathcal N(0, \mathbf I_d)} \left[ e^{\omega^\top (q + k)} \right] \\ &\approx e^{-\frac 12 (\| q \|^2 + \| k \|^2)} \; \frac 1M \sum_{i=1}^M e^{\omega_m^\top (q + k)} \end{align*} $$` with `$\omega_1, \ldots, \omega_M$` i.i.d `$\mathcal N(0, \mathbf I_d)$` --- # Linear transformers - https://arxiv.org/abs/2009.14794
.center[
] --- class: center, middle, inverse # 5. Application to health sequences --- # Application to health sequences We can use either a .stress[sequence] or a .stress[graph] structure .center[
] --- # Application to health sequences We use .stress[encoder-only] architectures .center[
] --- # Application to health sequences - **ZiMM: a deep learning model for long term adverse events with non-clinical claims data**, *NeurIPS 2019, ML4H workshop* and *Journal of Biomedical Informatics*, 2020 .center[
] --- class: center, middle, inverse # 6. Self-supervised learning --- # Self-supervised learning use .stress[pretext tasks] hence the name **self-supervised**. For **NLP** a strategy called **Masked Language Modeling** does the following: - Selects **15% of the tokens at random** in a sequence - among them, **replace 80% by the `MASK` token**, **10% by a random code** and leave the **remaining 10% unchanged** - .stress[Predict the token] hidden behind the `MASK` token This self-supervised strategy is one of the core ingredient of BERT - *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding* (https://arxiv.org/abs/1810.04805) Other strategies involve **sequence order** prediction, among others --- # Self-supervised learning Recent .stress[very impressive results] in **computer vision** .center[
] .center[.tiny[From https://arxiv.org/abs/2006.07733]] Let's explain *A Simple Framework for Contrastive Learning of Visual Representations* (https://arxiv.org/abs/2002.05709) --- # Self-supervised learning The **main ingredients** for .stress[self-supervised learning] (SimCLR version) - A **stochastic data augmentation** module. Transforms each input `$x_i$` into randomly data-augmented versions `$\tilde x_i$` and `$\tilde x_j$`. The pair `$(\tilde x_i, \tilde x_j)$` is called a **positive pair**. .center[
] .center[.tiny[From https://arxiv.org/abs/2002.05709]] --- # Self-supervised learning The **main ingredients** for .stress[self-supervised learning] (SimCLR version) - An **encoder** `$f(\cdot)$` (for instance a ResNet50) that we want to train. We compute with it `$h_i = f(\tilde x_i)$` and `$h_j = f(\tilde x_j)$` - A **projection head** `$g(\cdot)$` given by a simple feed-forward network, such as a 1-hidden layer network `$$ z_i = g(h_i) = \textbf W^{(2)} \; \text{ReLU}( \mathbf W^{(1)} \; h_i)) $$` - Create data-augmentations pairs `$\{ \tilde x_k \}_{k=1, \ldots, 2N}$` of the size `$N$` mini-batch. On a positive pair `$(i, j)$` we compute the **contrastive loss** `$$ \ell(i, j) = -\log \left( \frac{e^{\text{sim}(z_i, z_j) / \tau}}{\sum_{k=1}^{2N} \mathbf 1_{k \neq i} e^{\text{sim}(z_i, z_k) / \tau}} \right) $$` where `$\text{sim}(u, v) = u^\top v / \| u \| \| v \|$` is the cosine similarity --- # Self-supervised learning The **main ingredients** for .stress[self-supervised learning] (SimCLR version) - The loss on the data-augmented mini-batch `$\{ \tilde x_k \}_{k=1, \ldots, 2N}$` is given by `$ \frac{1}{2N} \sum_{k=1}^N \left( \ell(2 k - 1, 2 k) + \ell(2 k, 2 k - 1) \right) $` .center[
] .center[.tiny[From https://arxiv.org/abs/2002.05709]] --- # Self-supervised learning The **main ingredients** for .stress[self-supervised learning] (SimCLR version) .center[
] .center[.tiny[From https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html]] --- # Self-supervised learning - This version of self-supervised learning requires the use of .stress[large mini-batches] - So that **enough negatives** are used in the contrastive loss - .stress[Strong computational] and .stress[memory] footprint A convincing **alternative approach** is: - *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733) Some mostly remaining **open topics** - **Self-supervised learning** for .stress[electronic health records] ? - Data-augmentation strategies **beyond computer vision** ? - What about .stress[structured time series] ? --- class: center, middle, inverse # Thank You !