Leveraging text representations for clinical predictive tasks

Tristan Naumann

Leveraging text representations for clinical predictive tasks

Tristan Naumann

PhD Thesis: Massachusetts Institute of Technology | June 2018

Download BibTex

The increasing prevalence of digitized clinical data creates new opportunities to use machine learning to unlock clinical insights, and ultimately improve healthcare delivery. However, while data from Electronic Health Records (EHRs) have become common, they present unique challenges. Clinical data are noisy, sparse, irregularly sampled, and often biased in their recording of health state and care patterns. Further, much of the most important information used by care staff is recorded in unstructured text notes that are not easily deciphered by non-experts. In this work, we present machine learning methods that distill large amounts of text-based clinical data into latent representations. These representations are then used to predict a variety of important outcomes. In particular, we focus on prediction tasks that can provide evidence-based risk assessment and forecasting in settings with guidelines that have not traditionally been data-driven. We consider several abstractions for clinical narrative text, and evaluate their utility on common predictive tasks, such as mortality and readmission. We argue that a “good” representation will improve performance on these tasks and that multiple representations may be necessary, as different models excel on differing tasks. We present three case studies in which we use representations of clinical text to improve performance of clinical prediction tasks. First, we augment predictive models that used baseline clinical features by including features from clinical progress notes [31].These features are derived using Latent Dirichlet Allocation (LDA) and incorporated as features using per-patient topic membership…