Personalized Risk Predictions with Deep Learning Methods in the Presence of Missing and Biased Electronic Health Record Data

NIH RePORTER · NIH · R01 · $332,106 · view on reporter.nih.gov ↗

Abstract

Abstract Since 2010, clinical medicine has benefited from a rapid surge of clinical research on chronic diseases using data from electronic health records (EHRs). EHRs are appealing because they can offer large sample sizes, timely information, and a wealth of clinical information beyond that obtained from either health surveys or administrative data. However, while millions of patient records are included in large EHR records, they are not population-representative random samples, a constraint that potentially biases inferences based on such data and, therefore, has limited their utility for population health research. EHR data typically contain multiple types of biases, particularly: 1) sampling inclusion bias: EHR data only include information on patients visiting participating medical systems, and they primarily capture data when patients are ill. Even among populations with a particular disease, patients represented in EHRs tend to over-represent individuals who are sicker and have higher health care utilization; 2) sampling frequency bias: the numbers of patients’ encounters and features in EHRs are at various frequencies and these frequencies correlate with both patients’ characteristics and outcomes; and 3) institution bias: EHR samples of any hospital reflect the characteristics of patients population served by that specific hospital. Consequently, EHR-based risk prediction models will have 1) biases in risk factor selection and estimation for population inferences; 2) disparate mistreatment (unfairness) in terms of variation in a model’s prediction accuracy across patient subgroups (such as gender, race, and age) with various sampling inclusion probabilities or frequencies; 3) biased prediction model to reflect characteristics of patients served by the local hospitals. We propose to develop: 1) effective sample-weighting method to correct biases in risk factor selection and estimation for population inferences (Aim 1), 2) flexible deep learning method for EHR personalized risk prediction with fairness criteria (Aim 2); and 3) innovative calibration method to improve reproducibility of EHR-based risk models between institutions (Aim 3). We will predict risk of subsequent incident cardiovascular disease (CVD) in patients with type 2 diabetes (T2DM) as a demonstration of methodology development. Broader use of these methods will be generally applicable to other diseases outcomes and population of interest. To develop and validate these methods, we propose to analyze three unique datasets: 1) the New York University Langone Health EHR data (NYU-CDRN, 2009 to now) including demographics, vitals, diagnoses, lab results, prescriptions, and procedures; 2) the New York City Clinical Data Research Network (NYC-CDRN)—an EHR network comprising 20 NYC healthcare institutions, including the NYU-CDRN, with longitudinally linked data on >12 million patient encounters under a Common Data Model, and 3) the Health and Retirement Survey (HRS, begun in 1992...

Key facts

NIH application ID: 10463550
Project number: 5R01LM013344-02
Recipient: NEW YORK UNIVERSITY SCHOOL OF MEDICINE
Principal Investigator: Padhraic Smyth
Activity code: R01
Funding institute: NIH
Fiscal year: 2022
Award amount: $332,106
Award type: 5
Project period: 2021-08-06 → 2025-05-31