# Statistical Methods for Analyzing Electronic Health Record Data

> **NIH NIH R01** · UNIVERSITY OF PENNSYLVANIA · 2020 · $396,834

## Abstract

Abstract
The overarching goal of this proposal is to develop innovative statistical methods for Electronic
Health Record (EHR) based research. Clinically relevant information from the EHR permits the
derivation of a rich collection of phenotypes. Unfortunately, since the data is primarily collected
for clinical rather than research purposes, the true status of any given individual with respect to
the trait of interest is not necessarily known. A common study design is to use structured clinical
data elements to identify case and control groups on which subsequent analyses are based. To
minimize identification error, a common practice is that separate, but complimentary, rules are
developed to select individuals for the case and control groups, with case selection rules
emphasizing a high positive predictive value (PPV), and control selection rules emphasizing a
high negative predictive value (NPV). The accuracy of control identification is usually high as
the sheer number of available controls permits overly restrictive definition constructed to insure
a high NPV. In contrast, contamination by subjects who are not cases plagues case selection, as
the need to have adequate sample size, and by extension power, must be balanced against the
probability that those selected truly have the trait of interest. We call these non-cases “ineligibles”
because they do not satisfy the definition of controls. Ignoring inaccuracy in case identification
by treating ineligibles as true cases can lead to biased analysis. No statistical methods yet exist
for addressing the bias resultant from this unique challenge of case contamination in EHR-based
case-control studies. In particular, statistical methods for the classical misclassification problem
where labels for cases and controls are switched are not applicable. The current standard practice
limits analysis to a further selected subset with high PPV, which may have practically ignorable
bias but not efficient. This proposal aims to fill in this gap by developing efficient statistical
methods when “gold standard” case versus non-case status is available from medical chart
review for a validation subset of candidate cases. Our methods, accompanied by comprehensive
and user-friendly software, will offer researchers a rich arsenal of statistical methods and tools
for analyzing EHR data.

## Key facts

- **NIH application ID:** 9969193
- **Project number:** 5R01HL138306-03
- **Recipient organization:** UNIVERSITY OF PENNSYLVANIA
- **Principal Investigator:** Jinbo Chen
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $396,834
- **Award type:** 5
- **Project period:** 2018-09-01 → 2023-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9969193

## Citation

> US National Institutes of Health, RePORTER application 9969193, Statistical Methods for Analyzing Electronic Health Record Data (5R01HL138306-03). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/9969193. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*