# Predictive Modeling with High-Dimensional lncomplete Data

> **NIH NIH R01** · RUTGERS, THE STATE UNIV OF N.J. · 2022 · $183,360

## Abstract

Predictive modeling is the cornerstone of individualized health care. The outcome of interest is most frequently the
presence or absence of a health condition, and a large number of predictors are commonly available for model
building. Both the high dimensional data and the missing data have posed great challenges in statistical inference
related to predictive modeling. The overarching goal of this proposal is to address methodological challenges of
predicting binary outcomes with high-dimensional incomplete data. Specifically, the PIs proposed to address the
methodological challenges from the following two perspectives: (1) Quantify the uncertainty for the risk prediction
based on the high-dimensional logistic model; (2) Accommodate two study designs where missingness happens in a
structured way, including the “Positive-only” study design and the two-phase design.
Recent years have seen great breakthroughs in statistical inference methods for analyzing high-dimensional data
arising from a wide spectrum of scientific fields, with a focus primarily on a single regression coefficient in the
generalized linear models. Inferential methods for confidence interval construction and hypothesis testing for the
predicted probability, which is a function of all regression coefficients, are largely lacking. We develop innovative
statistical methods in this proposal towards filling this methodological gap in high dimensional data analysis. Our
proposed method is innovative also because they accommodate the structured incomplete data which arises from
important sampling designs. To our best knowledge, to date, statistical inference methods for high dimensional data
analysis have exclusively focused on data arising from complete data arising from cross-sectional study designs. We
additionally consider two important study designs with incomplete data, one is termed as the “positive-only” study
design that arises in EHR phenotyping, and the other is the two-phase design, an important cost-effective sampling
design that aims to reduce cost for measuring expensive predictors. We elucidate methodological challenges of
accommodating the missing data issues in downstream analysis and provide corresponding solutions.

## Key facts

- **NIH application ID:** 10495366
- **Project number:** 5R01GM140463-03
- **Recipient organization:** RUTGERS, THE STATE UNIV OF N.J.
- **Principal Investigator:** Zijian Guo
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $183,360
- **Award type:** 5
- **Project period:** 2020-09-01 → 2024-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10495366

## Citation

> US National Institutes of Health, RePORTER application 10495366, Predictive Modeling with High-Dimensional lncomplete Data (5R01GM140463-03). Retrieved via AI Analytics 2026-05-21 from https://api.ai-analytics.org/grant/nih/10495366. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*