Statistical Methods for Incorporating Machine Learning Tools in Inference and Large-Scale Surveillance using Electronic Medical Records Data

NIH RePORTER · NIH · R01 · $485,987 · view on reporter.nih.gov ↗

Abstract

SUMMARY The modernization and standardization of clinical care information systems is creating large networks of linked electronic health records (EHR) that capture key treatments and select patient outcomes for millions of patients throughout the country. The observational data emerging from these systems provide an unparalleled opportunity to learn about the effectiveness of existing and novel treatments, and to monitor potential safety issues that may arise when interventions are used in broad patient populations. However, observational clinical data have exposures that are driven by many factors and therefore aggressive adjustment is needed to remove as much confounding bias as possible in order to make attribution regarding select exposures. The field of machine learning provides a powerful collection of data-driven approaches for performing flexible, thorough confounding adjustment, but performing reliable statistical inference is particularly challenging when these techniques are used as part of the analytic strategy. We propose to advance reproducible research methods by developing and illustrating novel targeted learning tools that leverage the flexibility of machine learning methods to detect and characterize health effect signals using large-scale EHR data. Specifically, we will first develop techniques for making efficient, statistically valid and robust inference for treatment effects using state-of-the-art machine learning tools. We will also develop online learning techniques to make such inference in the context of streaming EHR data. Methodological advances will enable us to formulate a formal, rigorous and practical framework for conducting continuous, effective and reliable surveillance for safety endpoints. Finally, we will develop statistical approaches for incorporating prior information -- including demographic, epidemiologic or pharmacodynamic knowledge, for example -- to improve health effect estimation and inference when the health outcome of interest is rare and the statistical problem is thus difficult, as often occurs in safety surveillance. The ultimate goal of the proposed research is to enable biomedical researchers and public health regulators to carefully monitor and protect the health of the public by allowing them to more effectively and more reliably detect critical health effect signals that may be contained in population-scale EHR data.

Key facts

NIH application ID
9979940
Project number
5R01HL137808-02
Recipient
UNIVERSITY OF WASHINGTON
Principal Investigator
Marco Carone
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$485,987
Award type
5
Project period
2019-07-18 → 2024-06-30