# Addressing Algorithmic Unreliability and Dataset Shift in EHR-based Risk Prediction Models

> **NIH NIH F31** · UNIVERSITY OF PENNSYLVANIA · 2024 · $48,974

## Abstract

Project Summary
Predictive analytic algorithms built on electronic health record (EHR) inputs, such as patient characteristics,
administrative codes, and lab values, are increasingly used in health care settings to direct resources to high-
risk patients. Data play an indispensable role in the development and deployment of effective predictive models.
The greatest, yet understudied, challenge in the maintenance of these tools arises from a data-related concern,
namely dataset shift, in which training data distribution differs from the population on which the algorithm is
deployed, leading to model deterioration and inaccurate risk predictions. Dataset shift is a pervasive cause of
algorithmic unreliability in EHR-based models due to inevitable changes in physician behaviors and health
system operations that alter (1) the input distribution (covariate drift); and (2) changes in the relationship between
predictors and outcome (concept drift). Sudden changes in healthcare utilization during the COVID-19 pandemic
may have impacted the data generation process and the performance of clinical predictive models. Our
preliminary study showed that decreased collection of patient labs during the COVID-19 quarantine period led
to sparse data generation for important predictors of a single-institution EHR-based mortality risk prediction
algorithm, underpredicting risk for patients with advanced cancers. Despite the increasing use of predictive tools
in high stakes clinical applications; and growing recognition of dataset shift, we lack a framework for reasoning
shift and its effects on care delivery; and for proactively addressing shift to maintain performance over time. In
Aim 1, we propose to extend prior works on shift to a nationally deployed risk prediction algorithm, the VA Care
Assessment Need (CAN) model, used on millions of VA beneficiaries each year. The VA CAN model predicts
the likelihood of hospitalization within 90 days or 1 year after a primary care encounter to identify high-risk
patients who would benefit from additional outpatient interventions. We also investigate covariate and concept
drift as two possible mechanisms for COVID-19 associated dataset shift. In Aim 2, we apply an interrupted time
series design to study the association between sudden shift at the onset of the pandemic on case-management
decisions. Current solutions to address dataset shift have primarily been reactive (i.e. model retraining with
recent data), however, fail to be robust in new testing environments. In Aim 3, we consider revision of the VA
CAN model via machine learning and inclusion of variables that reflect potential drivers of shift. This project is
innovative as it is the first to leverage a rigorous statistical framework to study extent and mechanisms of shift
and develop proactive guidelines for model maintenance. The training plan is rigorous for Ms. Kolla, an MD-PhD
student in biostatistics. She is strongly supported by her department and institution as ...

## Key facts

- **NIH application ID:** 10884196
- **Project number:** 5F31LM014282-02
- **Recipient organization:** UNIVERSITY OF PENNSYLVANIA
- **Principal Investigator:** Likhitha Kolla
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $48,974
- **Award type:** 5
- **Project period:** 2023-06-01 → 2026-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10884196

## Citation

> US National Institutes of Health, RePORTER application 10884196, Addressing Algorithmic Unreliability and Dataset Shift in EHR-based Risk Prediction Models (5F31LM014282-02). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10884196. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*