# Adjusting for selection bias due to missing data in electronic health records-based research

> **NIH NIH F31** · HARVARD UNIVERSITY D/B/A HARVARD SCHOOL OF PUBLIC HEALTH · 2020 · $22,135

## Abstract

The adoption of electronic health records (EHR) in routine healthcare has resulted in a hugely promising
source of data for public health and medical research. Because EHR include rich data on large populations
at relatively low cost, many researchers have turned to observational studies using EHR as an alternative to
conducting randomized studies that are often prohibitively expensive and time-consuming to perform. However,
data are not collected for research purposes, and the potential for selection bias is high when analyses are
restricted to patients with complete data. Standard methods to adjust for selection bias due to missing data, such
as inverse probability weighting (IPW) and multiple imputation (MI), fail to address the complex nature of EHR
data. Speciﬁcally, these methods tend to oversimplify the interplay of numerous decisions by patients, physicians,
and insurers that collectively determine whether complete data is observed.
 One method for addressing selection bias due to missing data involves breaking down the complex process
that governs whether or not a patient has complete data into a series of more manageable sub-mechanisms. This
method involves characterizing the data provenance, or the process by which data appears in EHR. Statistical
models can then be built for selection at each sub-mechanism to better reﬂect the true data provenance. A frame-
work for estimation has been developed in which IPW is used to adjust for selection at every sub-mechanism.
 Since MI is generally more efﬁcient than IPW, strategies for 'blended analyses' will be developed that simulta-
neously implement IPW and MI under the modularized speciﬁcation. Estimation and inferential procedures under
this framework will be established, and extensions to Rubin's rules for the variance of estimators that combine
results across multiply imputed datasets in this framework will be derived.
 IPW and MI fail to produce consistent estimates when data is missing not at random (MNAR); that is, when
the probability that some covariate or outcome is measured depends on the value of the covariate itself, or other
factors that are not completely measured in the EHR. Methods for sensitivity analyses will be developed to assess
the extent to which estimators yielded by these methods are impacted by such unobserved data.
 The methods described in these aims will be applied to EHR-derived data that include long-term health out-
comes among 13,000 individuals with type 2 diabetes who underwent bariatric surgery between 1997 and 2013.
Speciﬁcally, this research will answer open questions about the efﬁcacy and safety of bariatric surgery in the
treatment of patients with obesity and type 2 diabetes, and will consider how rates of micro- and macrovascu-
lar complications associated with diabetes differ between patients undergoing alternative surgical procedures.
Robust software will be developed that provides researchers valid, practical, and user-friendly tools for the the
i...

## Key facts

- **NIH application ID:** 9742288
- **Project number:** 5F31DK118817-02
- **Recipient organization:** HARVARD UNIVERSITY D/B/A HARVARD SCHOOL OF PUBLIC HEALTH
- **Principal Investigator:** Tanayott Thaweethai
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $22,135
- **Award type:** 5
- **Project period:** 2018-08-01 → 2020-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9742288

## Citation

> US National Institutes of Health, RePORTER application 9742288, Adjusting for selection bias due to missing data in electronic health records-based research (5F31DK118817-02). Retrieved via AI Analytics 2026-05-28 from https://api.ai-analytics.org/grant/nih/9742288. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*