# Data Driven Methods for Missing Data Imputation in Surgical Disparities Research

> **NIH NIH R01** · UNIVERSITY OF PITTSBURGH AT PITTSBURGH · 2022 · $241,383

## Abstract

Project Summary/Abstract
Disparities in health and health care have been a longstanding challenge in the United States. One specific
area of medical care in which racial/ethnic disparities have been identified is total joint arthroplasty (TJA),
particularly total knee arthroplasty (TKA) and total hip arthroplasty (THA). Large, population based studies
necessary to address healthcare disparities can be costly and difficult to perform, and may be compromised by
sampling strategies and patient selection biases. Efficient alternatives are publicly-available nationally
representative databases such as the HCUP State Inpatient Databases (SID) and National Inpatient Sample
(NIS). The SID provide information on all patients admitted to hospitals within participating states, allowing for
comparison of health care access among many vulnerable populations, across states, and over time. The NIS
is the largest publicly-available all-payer inpatient health care database in the nation. It is sampled from the
SID through a complex survey design, yielding national estimates of health care utilization, quality, and
outcomes. A significant limitation of the NIS and the SID is the quantity of missing data. In particular, “patient
race”, a key indicator for health disparities research, has a high proportion of missingness. Multiple imputation
(MI) approaches have been increasingly popular for providing sound statistical methods to account for missing
data. When conducting MI, it is suggested that imputation models be as general as data allow them to be, in
order to accommodate a wide range of subsequent analyses of imputed data sets. This requires all
relationships that are going to be investigated in any subsequent analysis, such as nonlinearities and
interactions, to be included in the imputation model. Unfortunately, traditional MI methods, such as the
multivariate imputation by chained equations (MICE), are built on parametric imputation models. These models
are often not flexible enough to capture interactions and nonlinearities in high dimensional and large scale data
settings. Unlike parametric models, machine learning techniques (MLTs) are model-free methods, and thus
provide flexibility for missing data imputation. MLTs use algorithms that automatically and iteratively learn from
all data to detect statistical dependencies in observations without being explicitly programmed where to look.
The goal of this study is to make the two HCUP databases a more useful resource for the study of surgical
disparities and other areas of medicine. Accordingly, we propose novel MI methods based on MLTs to impute
missing data in the SID and the NIS, and to use the imputed datasets to measure racial disparity in TKA.

## Key facts

- **NIH application ID:** 10771341
- **Project number:** 7R01MD013901-05
- **Recipient organization:** UNIVERSITY OF PITTSBURGH AT PITTSBURGH
- **Principal Investigator:** Yan Ma
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $241,383
- **Award type:** 7
- **Project period:** 2019-09-24 → 2024-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10771341

## Citation

> US National Institutes of Health, RePORTER application 10771341, Data Driven Methods for Missing Data Imputation in Surgical Disparities Research (7R01MD013901-05). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10771341. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*