Machine learning is now a proven tool to automate the discovery of disease characteristics from large sets of electronic health records (EHRs). Yet most research efforts to date have ignored how diseases progress and instead modeled them as discrete events. Uncovering how disease phenotypes evolve and contribute to health outcomes over time may be the next step to better defining disease progression.
In a study published in the Journal of Biomedical Informatics, a team of researchers at Vanderbilt University Medical Center have used tensor factorization, a type of extended matrix factorization that models dynamic disease processes, to characterize the complexity of a cardiovascular disease (CVD) patient cohort based on longitudinal EHR data.
“Cardiovascular disease is complex and is affected by a variety of factors – including comorbidities and medications,” said Juan Zhao, Ph.D., a postdoctoral research fellow in biomedical informatics at Vanderbilt Center for Precision Medicine, and lead author on the study. “The effects typically evolve and progress over a long period of time.”
The researchers identified 12,380 adult CVD patients and extracted 1,068 unique ICD codes among them. Using tensor factorization modeling, they next identified a set of “subphenotypes” that presented in these patients over the 10 years prior to the first diagnosis of CVD and showed common progress patterns.
Subphenotypes associated with CVD that emerged included hypercholesterolemia, mixed hyperlipidemia, essential hypertension, type 2 diabetes, chronic airway obstruction, diabetic retinopathy, end-stage renal disease – all well-known risk factors and comorbidities. Less-expected subphenotypes included depression, Vitamin D deficiency, hypothyroidism, actinic keratosis, migraine, allergic rhinitis and urinary tract infection. Three reviewers with clinical practice or medical training background validated the results to ensure clinical relevance.
“In this experiment, the algorithms identify the standout subphenotypes and present us with trends and characteristics. These findings potentially enhance our knowledge of the mechanisms of complex diseases like CVD, and hopefully accelerate the process of bringing precision medicine into routine clinical care,” said Wei-Qi Wei, M.D., assistant professor of bioinformatics at Vanderbilt and senior/corresponding author on the paper.
Risk of Adverse CVD Outcomes
For each identified subphenotype, the researchers examined associated risk for adverse cardiovascular outcomes as estimated by the American College of Cardiology/American Heart Association Pooled Cohort Risk Equations. Combining association analysis with estimated CVD risk, they found that some subphenotypes such as Vitamin D deficiency, depression, and urinary tract infections could not be explained by conventional risk factors. This suggests that established risk assessment tools fail to accurately model risk of CVD for patients with these conditions, Zhao said.
“Results from the analysis showed that some subphenotypes were not correlated with conventional CVD risk factors,” said Zhao. “This suggests that individuals with these subphenotypes may be affected by diseases with different pathophysiologic causes.”
Using survival analysis, the team also compared subsequent myocardial infarction (MI) rates in patients diagnosed with CVD. They found markedly different rates among the six most common subphenotypes, indicating that patients in these groups may have clinically meaningful and distinct MI risk.
The researchers are moving forward to incorporate genetic information into their model, and applying it to other diseases. Said Zhao, “At Vanderbilt, we not only have the longitudinal data from our EHRs, we also have BioVU, our genomic database, and phenome-wide association studies that will help us apply the model to more varieties of data such as genetic interactions, drugs, and previously existing studies.”
The team hopes their model will advance the use of precision medicine in basic and clinical research. Said Wei, “Machine learning holds the promise of discovering hidden structures within complex EHRs.”
Added Zhao, “Perhaps it will help clinicians understand disease prevalence and probabilities. In addition, if we can identify the subphenotype pathways in complex diseases, we can optimize treatments.”