# Using the literature to build causal models of retrospective observational data

> **NIH NIH K99** · UNIVERSITY OF PITTSBURGH AT PITTSBURGH · 2021 · $76,101

## Abstract

Health data contain a wealth of information for research. Health data, such as found in electronic health
records (EHRs), allow for the identification links between health events, such as drug exposures and side-
effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse-
engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative
links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover
causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated
selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery
(LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit
the search for concepts linked to each other through specific verbs, e.g., “causes”, “treats”. Our hypothesis is
that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found
that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to
methods which use solely data, we hypothesize that our method will increase the ability to detect causal
relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the
identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available
reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance
benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will
systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference
standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different
subsets of LBD-derived information to GCMs at identifying drug side-effects. We will build causal models using
increasing numbers confounders. The second aim is to test the ability of LBD built with disease-specific
literature to improve the relevance of LBD derived confounders for Alzheimer's Disease (AD). We chose AD for
its high prevalence and relative lack of effective pharmacologic treatment. (A) Compare LBD strategies in a
disease-specific setting. We will test LBD variants using disease-specific literature or with LBD lacking subject-
matter restrictions. (B) Define the ability of robust LBD-informed GCM to validate drug repurposing candidates
for treating AD symptoms. We will test the ability of advanced methods to iteratively resolve hidden latent
confounding, when detected, to improve effect estimates. The fulfillment of these aims will yield new methods
to combine insights from the literature with causal modeling to uncover causal relationships of drug exposures
on adverse events and on beneficial outcomes.

## Key facts

- **NIH application ID:** 10125247
- **Project number:** 1K99LM013367-01A1
- **Recipient organization:** UNIVERSITY OF PITTSBURGH AT PITTSBURGH
- **Principal Investigator:** Scott Alexander Malec
- **Activity code:** K99 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $76,101
- **Award type:** 1
- **Project period:** 2021-08-01 → 2023-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10125247

## Citation

> US National Institutes of Health, RePORTER application 10125247, Using the literature to build causal models of retrospective observational data (1K99LM013367-01A1). Retrieved via AI Analytics 2026-05-27 from https://api.ai-analytics.org/grant/nih/10125247. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*