Generative AI for synthetic data: A framework to expand health data reach for research and ensure algorithmic fairness

NIH RePORTER · NIH · K99 · $89,209 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY/ABSTRACT Dr. Yan received his Ph.D. in computer science with a focus on developing privacy-preserving technologies to defend information systems against data breaches. His growing enthusiasm for maximizing the potential of AI/ML to deepen the understanding of patient data has oriented him toward a career in biomedical informatics. Throughout this transition, Dr. Yan has been dedicated to devising new methodologies to optimize the utility of patient data and encourage its ethical use. Specifically, Dr. Yan has developed and evaluated multiple generative AI algorithms to produce high-fidelity and privacy-respecting synthetic health data for critical downstream tasks. Over the past several years, synthetic electronic health record (EHR) data generation powered by generative AI technologies has gained substantial attention in the health domain due to its ability to protect privacy, promote data sharing, and improve the performance of medical AI by providing datasets with greater size, diversity, and population representativeness. Despite its great potential, there are significant gaps between the current state of this technology and its maximal worth: the evaluation of synthetic EHR data is subpar, its development cannot leverage data sources privately owned by multiple institutions and integrate established medical knowledge (which largely limits the quality of synthetically generated EHR data), and it is unknown how to use synthetic data to produce fair and reliable medical AI. This proposal aims to innovate computational methods, realized in open-source software, to enable the assessment, development, and utilization of synthetic health data. Aim 1 focuses on the development of a multi-dimensional, customizable evaluation framework that appraises synthetic health data in terms of its utility (i.e., value as an analytical resource), privacy, and fairness (i.e., ability to preserve subgroup representativeness in real data). This framework will further inform synthetic data creators of the appropriateness of a data generation model for a particular use case. Aim 2 will develop an ML architecture that allows multiple institutions to collaboratively train knowledge-integrated synthetic health data generation models using privately held datasets without data sharing. Aim 3 focuses on the development of a fairness-aware pipeline that 1) utilizes synthetic data to balance the representativeness of subpopulation, 2) embeds fairness constraints into the model training process, and 3) is agnostic of the AI/ML models relied upon. With the assistance of a multidisciplinary mentoring team from VUMC, Penn Medicine, and Weill Cornell Medicine, Dr. Yan will leverage EHR data from these institutions, as well as the All of Us and MIMIC EHR data, to develop and evaluate the proposed computational methods. Dr. Yan will expand his expertise through training in fairness design, federated learning algorithms, biomedical informatics and statistical methods...

Key facts

NIH application ID: 10984266
Project number: 1K99LM014428-01A1
Recipient: VANDERBILT UNIVERSITY MEDICAL CENTER
Principal Investigator: Chao Yan
Activity code: K99
Funding institute: NIH
Fiscal year: 2024
Award amount: $89,209
Award type: 1
Project period: 2024-09-01 → 2026-08-31