# Generative AI for synthetic data: A framework to expand health data reach for research and ensure algorithmic fairness

> **NIH NIH K99** · VANDERBILT UNIVERSITY MEDICAL CENTER · 2024 · $89,209

## Abstract

PROJECT SUMMARY/ABSTRACT
Dr. Yan received his Ph.D. in computer science with a focus on developing privacy-preserving technologies to
defend information systems against data breaches. His growing enthusiasm for maximizing the potential of AI/ML
to deepen the understanding of patient data has oriented him toward a career in biomedical informatics.
Throughout this transition, Dr. Yan has been dedicated to devising new methodologies to optimize the utility of
patient data and encourage its ethical use. Specifically, Dr. Yan has developed and evaluated multiple generative
AI algorithms to produce high-fidelity and privacy-respecting synthetic health data for critical downstream tasks.
Over the past several years, synthetic electronic health record (EHR) data generation powered by generative AI
technologies has gained substantial attention in the health domain due to its ability to protect privacy, promote
data sharing, and improve the performance of medical AI by providing datasets with greater size, diversity, and
population representativeness. Despite its great potential, there are significant gaps between the current state
of this technology and its maximal worth: the evaluation of synthetic EHR data is subpar, its development cannot
leverage data sources privately owned by multiple institutions and integrate established medical knowledge
(which largely limits the quality of synthetically generated EHR data), and it is unknown how to use synthetic
data to produce fair and reliable medical AI. This proposal aims to innovate computational methods, realized in
open-source software, to enable the assessment, development, and utilization of synthetic health data. Aim 1
focuses on the development of a multi-dimensional, customizable evaluation framework that appraises synthetic
health data in terms of its utility (i.e., value as an analytical resource), privacy, and fairness (i.e., ability to preserve
subgroup representativeness in real data). This framework will further inform synthetic data creators of the
appropriateness of a data generation model for a particular use case. Aim 2 will develop an ML architecture that
allows multiple institutions to collaboratively train knowledge-integrated synthetic health data generation models
using privately held datasets without data sharing. Aim 3 focuses on the development of a fairness-aware
pipeline that 1) utilizes synthetic data to balance the representativeness of subpopulation, 2) embeds fairness
constraints into the model training process, and 3) is agnostic of the AI/ML models relied upon. With the
assistance of a multidisciplinary mentoring team from VUMC, Penn Medicine, and Weill Cornell Medicine, Dr.
Yan will leverage EHR data from these institutions, as well as the All of Us and MIMIC EHR data, to develop and
evaluate the proposed computational methods. Dr. Yan will expand his expertise through training in fairness
design, federated learning algorithms, biomedical informatics and statistical methods...

## Key facts

- **NIH application ID:** 10984266
- **Project number:** 1K99LM014428-01A1
- **Recipient organization:** VANDERBILT UNIVERSITY MEDICAL CENTER
- **Principal Investigator:** Chao Yan
- **Activity code:** K99 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $89,209
- **Award type:** 1
- **Project period:** 2024-09-01 → 2026-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10984266

## Citation

> US National Institutes of Health, RePORTER application 10984266, Generative AI for synthetic data: A framework to expand health data reach for research and ensure algorithmic fairness (1K99LM014428-01A1). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10984266. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
