# Efficient Statistical Learning Methods for Personalized Medicine Using Large Scale Biomedical Data

> **NIH NIH R01** · UNIV OF NORTH CAROLINA CHAPEL HILL · 2020 · $328,853

## Abstract

Project Summary/Abstract
 Current medical treatment guidelines largely rely on data from randomized controlled trials that study 
average effects, which may be inadequate for making individualized decisions for real-world patients. Large-scale
electronic health records (EHRs) data provide unprecedented opportunities to optimize personalized treatment
strategies and generate evidence relevant to real-world patients. However, there are inherent challenges in the
use of EHRs, including non-experimental nature of data collection processes, heterogeneous data types with
complex dependencies, irregular measurement patterns, multiple dynamic treatment sequences, and the need
to balance risk and benefit of treatments. Using two high-quality EHR databases, Columbia University Medical
Center's clinical data warehouse and the Indiana Network for Patient Care database, and focusing on type 2
diabetes (T2D), this proposal will develop novel and scalable statistical learning approaches that overcome these
challenges to discover optimal personalized treatment strategies for T2D from real-world patients. Specifically,
under Aim 1, we will develop a unified framework to learn latent temporal processes for feature extraction and
dynamic patient records representation. Our approach will accommodate large-scale variables of mixed types
(continuous, binary, counts) measured at irregular intervals. They extract lower-dimensional components to reflect
patients' dynamic health status, account for informative healthcare documentation processes, and characterize
similarities between patients. Under Aim 2, we will develop fast and efficient multi-category machine learning
methods, in order to evaluate treatment propensities and adaptively learn optimal dynamic treatment regimens
(DTRs) among the extensive number of treatment options observed in the EHRs. The methods will provide 
sequential decisions that determine the best treatment sequence for a T2D patient given his/her EHRs. Under Aim
3, we will develop statistical learning methods to assist multi-faceted treatment decision-making, which balances
risks versus benefits when evaluating a DTR. Our approach will ensure maximizing benefit to the greatest extent
while controlling all risk outcomes under the safety margins. For all aims, we will develop efficient stochastic
resampling algorithms to scale up the optimization for massive data sizes. We will identify optimal DTRs for T2D
using the extracted information from patients' comorbidity conditions, medications, and laboratory tests, as well
as records-collection processes. Our methodologies will be applied and cross-validated between the two EHR
databases. The treatment strategies learned from the representative EHR databases with a diverse patient 
population will be beneficial for individual patient care, assisting clinicians to adaptively choose the optimal treatment
for a patient. Finally, we will disseminate our methods and results through freely available softwar...

## Key facts

- **NIH application ID:** 9891071
- **Project number:** 5R01GM124104-03
- **Recipient organization:** UNIV OF NORTH CAROLINA CHAPEL HILL
- **Principal Investigator:** Yuanjia Wang
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $328,853
- **Award type:** 5
- **Project period:** 2018-04-01 → 2022-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9891071

## Citation

> US National Institutes of Health, RePORTER application 9891071, Efficient Statistical Learning Methods for Personalized Medicine Using Large Scale Biomedical Data (5R01GM124104-03). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/9891071. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*