# Developing scalable algorithms to incorporate unstructured electronic health records for causal inference based on real-world data

> **NIH NIH R01** · BRIGHAM AND WOMEN'S HOSPITAL · 2024 · $537,652

## Abstract

Project Summary/Abstract
The routine operation of the US Healthcare system produces an abundance of electronically-stored data that
captures the care of patients as it is provided in settings outside of controlled research environments. The
potential for utilizing these data to inform future treatment choices and improve patient care and outcomes of all
patients in the very system that generates the data is widely acknowledged. Given these key properties of the
routine-care data and the abundance of electronic healthcare databases covering millions of patients, it is critical
to strengthen the rigor of analyses of such data. Our group has previously developed an analytic approach to
reduce bias when analyzing routine-care databases, which has proven effective in more than 50 empirical
research studies across a range of topics and data sources. However, this approach currently cannot incorporate
free-text information that is recorded in electronic health records, such as clinical notes and reports. This
limitation has left a large amount of rich patient information underutilized for clinical research. We thus aim to
adapt and refine a set of established computerized natural language processing algorithms that can identify and
extract useful information from the clinical notes and reports in electronic health records and incorporate them
into our validated analytical approach for balancing background risks of different comparison groups, a key step
to ensure fair evaluation when comparing different therapeutic options. To test this newly integrated and
augmented approach, we will implement and adapt it in simulation studies where we can evaluate and improve
the performance of these new analytic methods in a controlled but realistic fashion. In addition, we will assess
the performance of our new approach in 8 practical studies comparing medical or surgical treatments that are
highly relevant to patients. To ensure highest level of data completeness and quality, we have linked multiple
healthcare utilization (claims) databases, spanning from 2007 to 2016, with 3 electronic health records systems,
including one each in Massachusetts, North Carolina, and Texas. This data will allow testing of our newly
integrated approach in a variety of care delivery systems and data environments, which will be very informative
for the application of our products in the real-world settings.

## Key facts

- **NIH application ID:** 10808997
- **Project number:** 5R01LM013204-05
- **Recipient organization:** BRIGHAM AND WOMEN'S HOSPITAL
- **Principal Investigator:** JOSHUA K LIN
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $537,652
- **Award type:** 5
- **Project period:** 2020-06-01 → 2025-12-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10808997

## Citation

> US National Institutes of Health, RePORTER application 10808997, Developing scalable algorithms to incorporate unstructured electronic health records for causal inference based on real-world data (5R01LM013204-05). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10808997. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
