“Fake Patients” May Answer Demand for Big Health Data

Simulated EHRs could safeguard and expand data sharing.

Machine learning has enabled unprecedented power to mine medical insights from large sets of electronic health records (EHRs). However, as recent headlines attest, this new frontier – in which hospital systems are tasked with housing, protecting and ethically sharing huge repositories of EHRs – is fraught with ongoing risk.

A solution may lie in new technology to generate “fake patients” using simulated EHRs, says Bradley Malin, Ph.D., vice chair for research affairs in the Department of Biomedical Informatics at Vanderbilt University Medical Center. Generative adversarial network (GAN) technology, a style of machine learning, has been lauded for some time as a means of simulating patients for large-scale research.

“No one record matches up with a real person, but the records are ‘trained’ through many iterations to yield the same predictive power.”

“With GAN machine learning, no one record matches up with a real person, but the records are ‘trained’ through many iterations to yield the same predictive power as the real records would,” Malin explained. “If you are using the data for training machine learning methodology, it is clearly the way to go.”

A Game of Approximation

GANs operate in a sort of cat and mouse game, wherein the machine learning model is continually challenged to detect what is real and what is machine-generated. With enough iterations, and with the right data tweaking each time, the divergence between a real and simulated artifact narrows until it is within the desired range.

To date, use of GANs to simulate EHR data has been slowed by lack of a principled approach or evaluation model. In a new study published in JAMIA, Malin and his team, led by Vanderbilt computer science doctoral students Ziqi Zhang and Chao Yan, introduced an improved pipeline for EHR generation. They successfully tested it using Vanderbilt’s pioneering synthetic derivative database of de-identified patient records.

Improving GANs

The new EHR generator employs a GAN with several innovations – a “Wasserstein divergence” and “layer normalization techniques” – that Malin’s team tested against existing GANs, using billing codes from over a million EHRs.  “No model has been tested on this scale before,” Malin said. “It outperformed the state-of-the-art approaches, with significant improvement in retaining the nature of real records – including prediction performance and structural properties – without sacrificing privacy.”

Malin points out that there are limitations to the current simulations. “GANs are only as good as the information you have,” he said. “We can’t take out biases that exist due to errors in the way the record is built.” Additionally, when the research goal is to investigate rare diseases or other anomalies, the process of “rounding off the edges” may not align closely enough with the target patient population, he says.

Safer Data Sharing

“We share real records – just not on real people.”

 The Civil Rights Commission and Health and Human Services oversee EHR HIPAA compliance. Any charge that an institution has exposed private health information becomes a matter for civil litigation, or in egregious cases, investigation and official censure. The steps required to mitigate liability when sharing real patient data are complex and costly.

Malin’s model could eliminate this liability by generating and sharing data with the research value of real EHRs without exposing actual records. “We share real records – just not on real people,” he said.

Malin is currently working to transition this patient simulation methodology to work within All of Us, a national program led by the NIH that is collecting data sets on a million people across multiple hospital systems.