class: middle, center, title-slide # Deep Learning for Electronic Health Records ## DU IA appliqué à la santé
**Stéphane Gaïffas**
[https://stephanegaiffas.github.io](https://stephanegaiffas.github.io)
.grid[ .kol-1-4[.width-55[![](figures/up.svg)]] .kol-1-4[.width-100[![](figures/lpsm.svg)]] .kol-1-4[.width-45[![](figures/ens.svg)]] .kol-1-4[.width-100[![](figures/prairie.svg)]] ] --- class: end-slide, center, middle # 1. Deep learning in healthcare --- # Deep learning in healthcare Is mosty about **computer vision** and **natural language processing** .center[
] --- # Deep learning in healthcare ## Computer vision .center[
] .footnote[From: "Dermatologist-level classification of skin cancer with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)] --- # Deep learning in healthcare ## Computer vision .center[
] .footnote[From: "Dermatologist-level classification of skin cancer with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)] --- # Deep learning in healthcare ## Computer vision .center[
] .footnote[From: "Dermatologist-level classification of skin cancer with deep neural networks" (https://www.nature.com/articles/nature21056.epdf)] --- # Deep learning in healthcare ## Computer vision .center[
] .footnote[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)] --- # Deep learning in healthcare ## Computer vision .center[
] .footnote[From: "Clinically applicable deep learning for diagnosis and referral in retinal disease" (https://www.nature.com/articles/s41591-018-0107-6?sf195443527=1)] --- # Deep learning in healthcare ## Natural Language Processing To cite but a **few**... - Chen L. et al, *Clinical trial cohort selection based on multi-level rule-based natural language processing system.*, J Am Med Inform Assoc. 2019 - Wang Y. et al, *Clinical information extraction applications: a literature review.* J Biomed Inform. 2018; 77: 34-49" - Sohn S., *Ascertainment of asthma prognosis using natural language processing from electronic medical records*. J Allergy Clin Immunol. 2018 - Sohn S. et al., *Analysis of clinical variations in asthma care documented in electronic health records between staff and resident physicians*. Stud Health Technol Inform. 2017 - Liang H. et al., *Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence*, Nat Med. 2019 --- class: end-slide, center, middle # 2. Beyond CV and NLP for healthcare --- # Deep learning for EHRs EHRs = Electronic Healthcare Records
.center[
] .footnote[From: "Benchmarking deep learning models on large healthcare datasets" (https://www.sciencedirect.com/science/article/pii/S1532046418300716)] --- # Deep learning for EHRs What is the **state-of-the-art** ? .center[
] .footnote[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)] --- # Deep learning for EHRs What is the **state-of-the-art** ? .center[
] .footnote[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)] --- # Deep learning for EHRs What is the **state-of-the-art** ? .center[
]
.footnote[From: "Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data" (https://arxiv.org/pdf/2010.01149.pdf)] --- class: end-slide, middle, center # 3. Self-advertisements --- # The X-CNAM project - CNAM = **Caisse Primaire d'Assurance Maladie** - With E. Bacry (CNRS, Univ. Paris Dauphine) - A partnership between X and CNAM - Data from **carte vitale** (almost all French citizens, many TeraBytes) .center[
] --- # The X-CNAM project - CNAM = **Caisse Primaire d'Assurance Maladie** - With E. Bacry (CNRS, Univ. Paris Dauphine) - A partnership between X and CNAM - Data from **carte vitale** (almost all French citizens, many TeraBytes) ### Papers - **SCALPEL3: a scalable open-source library for healthcare claims databases.** *International Journal of Medical Informatics*, 2020 - **Screening anxiolytics, hypnotics, antidepressants and neuroleptics for bone fracture risk among elderly: a nation-wide dynamic multivariate self-control study using the SNDS claims database**, *submitted to Drugs Safety* - **ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection**, *Biostatistics*, 2019 --- # Prairie 3IA Chair "PERHAPS" - **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies - **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital) .center[
] --- # Prairie 3IA Chair "PERHAPS" - **PERHAPS** = deeP learning for ElectRonic HeAlth records and applications in ProStatic pathologies - **Syndrome Métabolique et Pathologies Prostatiques** (with urologists from Tenon Hospital) ### Papers - **ZiMM: a deep learning model for long term adverse events with non-clinical claims data**, *NeurIPS 2019, ML4H workshop* and *Journal of Biomedical Informatics*, 2020 - **Which attention model and unsupervised pretraining strategy for electronic health records ?** *submitted* - **About contrastive unsupervised representation learning for classification and its convergence.** *submitted* --- # General-purpose DL techniques We'll describe **recent DL techniques** that can be **very useful** for EHRs **Architectures** involving **attention mechanisms** - *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236) - *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794) **Self-supervised learning** based on **contrastive learning** - *A Simple Framework for Contrastive Learning of Visual Representations* (https://arxiv.org/abs/2002.05709) - *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733) --- class: end-slide, center, center, middle # 4. Attention-based architectures --- # From RNNs to transformers - RNNs used to be the **state-of-the art** for **machine translation**, **time series analysis**, and more generally any **sequence-to-sequence task** - Then, **attention** was used **inside recurrent layers** to improve their **long-range dependency** - But RNNs are **hard to scale**: their **recurrent** nature **hinders distributed computations** A **game changer** came in 2017: - **Attention is all you need** by Vaswani et al. (2017) introduces the **transformer** architecture (https://arxiv.org/abs/1706.03762) - Many follow-ups since then... - Core ingredient is the **Multi-Head Self-Attention** layer Led to things like - BERT, Transformer-XL, GPT-3 (175 Billions of parameters !) --- # GPT-3 examples
.center[
] .center[ .small[https://app.inferkit.com/demo] ] --- # Self-attention layer - For the first layer, **input** is a sequence of **token embeddings** $$\mathbf X = [x\_1, \ldots, x\_L]$$ where $x\_i \in \mathbb R^d$ - **Output** is a same-length sequence of (hopefully) **contextualized embeddings** - For other layers, **input** is a sequence of **contextualized embedding vectors** (output of a previous self-attention layer) Next displays from - http://jalammar.github.io/illustrated-transformer --- # Self-attention layer It first computes **key**, **queries** and **values**: $$ \mathbf{Q} = \mathbf{X W}^Q, \quad \mathbf{K} = \mathbf{X W}^K, \quad \mathbf{V} = \mathbf{X W}^V $$ where $$ \mathbf{W}^Q \in \mathbb R^{d \times d\_k}, \quad \mathbf{W}^K \in \mathbb R^{d \times d\_k} \quad \text{and} \quad \mathbf{W}^V \in \mathbb R^{d \times d\_v} $$ are learned parameters and $\mathbf{X} \in \mathbb R^{d}$ is the input .center[
] --- # Self-attention layer And computes **inner products** between **keys** and **queries** and applies **softmax over rows** $$ \mathbf{Z} = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{d\_k}} \right) \mathbf{V} $$ .center[
] --- # Multi-head self-attention layer Combines $H$ **heads** of **self-attention** $$ \mathbf{Q}\_h = \mathbf{X W}\_h^Q, \quad \mathbf{K}\_h = \mathbf{X W}\_h^K, \quad \mathbf{V}\_h = \mathbf{X W}\_h^V $$ $$ \mathbf{Z}\_h = \text{softmax}\left( \frac{\mathbf{Q}\_h \mathbf{K}\_h^\top}{\sqrt{d\_k}} \right) \mathbf{V}\_h $$ $$ \text{MSA}(\mathbf{X}) = [\mathbf{Z}\_1 \; \cdots \; \mathbf{Z}\_H] \; \mathbf{W}^O $$ .center[
] --- # Multi-head self-attention layer **Visualization** of the **softmax matrix** .center[
] .center[
] Solves, among many others things **coreference resolution** (a difficult problem in machine translation) .footnote[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html] --- # Transformer architecture The **encoder** stacks several MSA layers as follows: $$ \mathbf{Y}\_k = \text{LayerNorm}(\mathbf{X}\_k + \text{MSA}(\mathbf{X}\_k)) $$ $$ \mathbf{X}\_{k+1} = \text{LayerNorm}(\mathbf{Y}\_k + \text{FF}(\mathbf{Y}\_k)) $$ where - $\text{FF}$ is a **dense** or **feed-forward** network - $\mathbf{X}\_1 = \mathbf{e}=$ the sequence of $L$ token embeddings - $\mathbf{X}\_k \in \mathbb R^{L \times d\_k}$ is the input of the $k$-th layer - $\mathbf{X}\_{k+1} \in \mathbb R^{L \times d\_k}$ is the output of the $k$-th layer --- # Transformer architecture **Layer normalization** versus **batch normalization** .center[
] --- # Transformer architecture Usually uses an **encoder / decoder architecture** .center[
] --- # Transformer architecture Usually uses an **encoder / decoder architecture** .center[
] .footnote[From https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html] --- # Positional embeddings - As such, **token embeddings** do not change with their **position** in the sequence - A strategy is **positional embeddings**: either **deterministic** or **trained** - Just add each **positional embedding** to each **token embedding** before pushing the tensor in the architecture - Original implementation uses **512 cosines and sines** - Example with only 2 cosines and sines: .center[
] --- # Quadratic complexity of MSA - The MSA layer has **memory and computational complexity** $O(L^2 d)$ - Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large) - Some **follow-up works** propose strategies to solve this
.center[
] --- # Quadratic complexity of MSA - The MSA layer has **memory and computational complexity** $O(L^2 d)$ - Huge demand of computational power and saturates GPU memory for **long sequences** ($L$ large) - Some **follow-up works** propose strategies to solve this **Graph Neural Networks** and **Graph Attention Networks** - *Graph Attention Networks* (https://arxiv.org/abs/1710.10903) - *Graph Neural Networks: A Review of Methods and Applications* (https://arxiv.org/abs/1812.08434) **Linear transformers** - *Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention* (https://arxiv.org/abs/2006.16236) - *Rethinking Attention with Performers* (https://arxiv.org/abs/2009.14794) --- class: end-slide, center, middle # 5. Application to healthcare pathways --- # Healthcare pathways .center[
] --- # Healthcare pathways **Encoder-only** architectures .center[
] --- # Healthcare pathways Allows to **improve strong baselines** .center[
] .footnote[From: "ZiMM: a deep learning model for long term adverse events with non-clinical claims data"] --- class: end-slide, center, middle # 6. Self-supervised learning --- # Self-supervised learning Use **pretext tasks** hence the name **self-supervised**. For **NLP** a strategy called **Masked Language Modeling** does the following: - Selects **15% of the tokens at random** in a sequence - among them, **replace 80% by the `MASK` token**, **10% by a random code** and leave the **remaining 10% unchanged** - **Predict the token** hidden behind the `MASK` token One of the core ingredient of BERT - *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding* (https://arxiv.org/abs/1810.04805) Other strategies involve **sequence order** prediction, among others --- # Self-supervised learning Recent **very impressive results** in **computer vision** .center[
] Let's explain "A Simple Framework for Contrastive Learning of Visual Representations" (https://arxiv.org/abs/2002.05709) .footnote[From https://arxiv.org/abs/2006.07733] --- # Self-supervised learning The **main ingredients** for **self-supervised learning** (SimCLR version) - A **stochastic data augmentation** module. Transforms each input $x\_i$ into randomly data-augmented versions $\tilde x\_i$ and $\tilde x\_j$. The pair $(\tilde x\_i, \tilde x\_j)$ is called a **positive pair**. .center[
] .footnote[From https://arxiv.org/abs/2002.05709] --- # Self-supervised learning The **main ingredients** for **self-supervised learning** (SimCLR version) - An **encoder** $f(\cdot)$ (for instance a ResNet50) that we want to train. We compute with it $h\_i = f(\tilde x\_i)$ and $h\_j = f(\tilde x\_j)$ - A **projection head** $g(\cdot)$ given by a simple feed-forward network, such as a 1-hidden layer network $$ z\_i = g(h\_i) = \textbf W^{(2)} \; \text{ReLU}( \mathbf W^{(1)} \; h\_i)) $$ - Create **data-augmentations** pairs $\\{ \tilde x\_k \\}\_{k=1, \ldots, 2N}$ of the size $N$ mini-batch. On a positive pair $(i, j)$ we compute the **contrastive loss** $$ \ell(i, j) = -\log \left( \frac{e^{\text{sim}(z\_i, z\_j) / \tau}}{\sum\_{k=1}^{2N} \mathbf 1\_{k \neq i} e^{\text{sim}(z\_i, z\_k) / \tau}} \right) $$ where $\text{sim}(u, v) = u^\top v / \\| u \\| \\| v \\|$ is the cosine similarity --- # Self-supervised learning The **main ingredients** for **self-supervised learning** (SimCLR version) - The loss on the data-augmented mini-batch $\\{ \tilde x\_k \\}\_{k=1, \ldots, 2N}$ is given by $ \frac{1}{2N} \sum\_{k=1}^N \left( \ell(2 k - 1, 2 k) + \ell(2 k, 2 k - 1) \right) $ .center[
] .footnote[From https://arxiv.org/abs/2002.05709] --- # Self-supervised learning .center[
] .footnote[From https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html] --- # Self-supervised learning - This version of self-supervised learning requires the use of **large mini-batches** - So that **enough negatives** are used in the contrastive loss - **Strong computational** and **memory** footprint A convincing **alternative approach** is: - *Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning* (https://arxiv.org/abs/2006.07733) --- # Conclusion Some **mostly open** topics in deep learning applied to EHRs: - **Self-supervised learning** for EHRs ? - In particular, how to perform **data-augmentation** ? - Impact of the **sparsity** and the **irregularity** of healthcare pathways on general-purpose models (RNN, transformers with positional encoding) --- class: end-slide, center, middle # Thank you !