Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups Using Electronic Health Records

by   Yanshan Wang, et al.

Machine learning has become ubiquitous and a key technology on mining electronic health records (EHRs) for facilitating clinical research and practice. Unsupervised machine learning, as opposed to supervised learning, has shown promise in identifying novel patterns and relations from EHRs without using human created labels. In this paper, we investigate the application of unsupervised machine learning models in discovering latent disease clusters and patient subgroups based on EHRs. We utilized Latent Dirichlet Allocation (LDA), a generative probabilistic model, and proposed a novel model named Poisson Dirichlet Model (PDM), which extends the LDA approach using a Poisson distribution to model patients' disease diagnoses and to alleviate age and sex factors by considering both observed and expected observations. In the empirical experiments, we evaluated LDA and PDM on three patient cohorts with EHR data retrieved from the Rochester Epidemiology Project (REP), for the discovery of latent disease clusters and patient subgroups. We compared the effectiveness of LDA and PDM in identifying latent disease clusters through the visualization of disease representations learned by two approaches. We also tested the performance of LDA and PDM in differentiating patient subgroups through survival analysis, as well as statistical analysis. The experimental results show that the proposed PDM could effectively identify distinguished disease clusters by alleviating the impact of age and sex, and that LDA could stratify patients into more differentiable subgroups than PDM in terms of p-values. However, the subgroups discovered by PDM might imply the underlying patterns of diseases of greater interest in epidemiology research due to the alleviation of age and sex. Both unsupervised machine learning approaches could be leveraged to discover patient subgroups using EHRs but with different foci.


page 1

page 2

page 3

page 4


Identifying Patterns of Associated-Conditions through Topic Models of Electronic Medical Records

Multiple adverse health conditions co-occurring in a patient are typical...

Deep Representation Learning of Electronic Health Records to Unlock Patient Stratification at Scale

Objective: Deriving disease subtypes from electronic health records (EHR...

Synthetic Patient Generation: A Deep Learning Approach Using Variational Autoencoders

Artificial Intelligence in healthcare is a new and exciting frontier and...

Beyond Topics: Discovering Latent Healthcare Objectives from Event Sequences

A meaningful understanding of clinical protocols and patient pathways he...

Computational Phenotype Discovery via Probabilistic Independence

Computational Phenotype Discovery research has taken various pragmatic a...

Unsupervised Learning for Computational Phenotyping

With large volumes of health care data comes the research area of comput...

Unsupervised Pseudo-Labeling for Extractive Summarization on Electronic Health Records

Extractive summarization is very useful for physicians to better manage ...

Please sign up or login with your details

Forgot password? Click here to reset