# Statistical Methods for Incorporating Machine Learning Tools in Inference and Large-Scale Surveillance using Electronic Medical Records Data

> **NIH NIH R01** · UNIVERSITY OF WASHINGTON · 2022 · $486,320

## Abstract

SUMMARY
The modernization and standardization of clinical care information systems is creating large networks of
linked electronic health records (EHR) that capture key treatments and select patient outcomes for
millions of patients throughout the country. The observational data emerging from these systems
provide an unparalleled opportunity to learn about the effectiveness of existing and novel treatments,
and to monitor potential safety issues that may arise when interventions are used in broad patient
populations. However, observational clinical data have exposures that are driven by many factors and
therefore aggressive adjustment is needed to remove as much confounding bias as possible in order to
make attribution regarding select exposures. The field of machine learning provides a powerful
collection of data-driven approaches for performing flexible, thorough confounding adjustment, but
performing reliable statistical inference is particularly challenging when these techniques are used as
part of the analytic strategy. We propose to advance reproducible research methods by developing and
illustrating novel targeted learning tools that leverage the flexibility of machine learning methods to
detect and characterize health effect signals using large-scale EHR data.
Specifically, we will first develop techniques for making efficient, statistically valid and robust inference
for treatment effects using state-of-the-art machine learning tools. We will also develop online learning
techniques to make such inference in the context of streaming EHR data. Methodological advances will
enable us to formulate a formal, rigorous and practical framework for conducting continuous, effective
and reliable surveillance for safety endpoints. Finally, we will develop statistical approaches for
incorporating prior information -- including demographic, epidemiologic or pharmacodynamic
knowledge, for example -- to improve health effect estimation and inference when the health outcome
of interest is rare and the statistical problem is thus difficult, as often occurs in safety surveillance.
The ultimate goal of the proposed research is to enable biomedical researchers and public health
regulators to carefully monitor and protect the health of the public by allowing them to more effectively
and more reliably detect critical health effect signals that may be contained in population-scale EHR
data.

## Key facts

- **NIH application ID:** 10463566
- **Project number:** 5R01HL137808-04
- **Recipient organization:** UNIVERSITY OF WASHINGTON
- **Principal Investigator:** Marco Carone
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $486,320
- **Award type:** 5
- **Project period:** 2019-07-18 → 2024-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10463566

## Citation

> US National Institutes of Health, RePORTER application 10463566, Statistical Methods for Incorporating Machine Learning Tools in Inference and Large-Scale Surveillance using Electronic Medical Records Data (5R01HL137808-04). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10463566. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
