# Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data

> **NIH NIH R01** · NEW YORK UNIVERSITY SCHOOL OF MEDICINE · 2022 · $332,106

## Abstract

Abstract
Since 2010, clinical medicine has benefited from a rapid surge of clinical research on chronic diseases using
data from electronic health records (EHRs). EHRs are appealing because they can offer large sample sizes,
timely information, and a wealth of clinical information beyond that obtained from either health surveys or
administrative data. However, while millions of patient records are included in large EHR records, they are not
population-representative random samples, a constraint that potentially biases inferences based on such data
and, therefore, has limited their utility for population health research. EHR data typically contain multiple types
of biases, particularly: 1) sampling inclusion bias: EHR data only include information on patients visiting
participating medical systems, and they primarily capture data when patients are ill. Even among populations
with a particular disease, patients represented in EHRs tend to over-represent individuals who are sicker and
have higher health care utilization; 2) sampling frequency bias: the numbers of patients’ encounters and
features in EHRs are at various frequencies and these frequencies correlate with both patients’ characteristics
and outcomes; and 3) institution bias: EHR samples of any hospital reflect the characteristics of patients
population served by that specific hospital. Consequently, EHR-based risk prediction models will have 1)
biases in risk factor selection and estimation for population inferences; 2) disparate mistreatment (unfairness)
in terms of variation in a model’s prediction accuracy across patient subgroups (such as gender, race, and age)
with various sampling inclusion probabilities or frequencies; 3) biased prediction model to reflect characteristics
of patients served by the local hospitals. We propose to develop: 1) effective sample-weighting method to
correct biases in risk factor selection and estimation for population inferences (Aim 1), 2) flexible deep learning
method for EHR personalized risk prediction with fairness criteria (Aim 2); and 3) innovative calibration method
to improve reproducibility of EHR-based risk models between institutions (Aim 3). We will predict risk of
subsequent incident cardiovascular disease (CVD) in patients with type 2 diabetes (T2DM) as a demonstration
of methodology development. Broader use of these methods will be generally applicable to other diseases
outcomes and population of interest. To develop and validate these methods, we propose to analyze three
unique datasets: 1) the New York University Langone Health EHR data (NYU-CDRN, 2009 to now) including
demographics, vitals, diagnoses, lab results, prescriptions, and procedures; 2) the New York City Clinical Data
Research Network (NYC-CDRN)—an EHR network comprising 20 NYC healthcare institutions, including the
NYU-CDRN, with longitudinally linked data on >12 million patient encounters under a Common Data Model,
and 3) the Health and Retirement Survey (HRS, begun in 1992...

## Key facts

- **NIH application ID:** 10463550
- **Project number:** 5R01LM013344-02
- **Recipient organization:** NEW YORK UNIVERSITY SCHOOL OF MEDICINE
- **Principal Investigator:** Padhraic Smyth
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $332,106
- **Award type:** 5
- **Project period:** 2021-08-06 → 2025-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10463550

## Citation

> US National Institutes of Health, RePORTER application 10463550, Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data (5R01LM013344-02). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10463550. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*