Using the literature to build causal models of retrospective observational data

NIH RePORTER · NIH · K99 · $68,663 · view on reporter.nih.gov ↗

Abstract

Health data contain a wealth of information for research. Health data, such as found in electronic health records (EHRs), allow for the identification links between health events, such as drug exposures and side- effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse- engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery (LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit the search for concepts linked to each other through specific verbs, e.g., “causes”, “treats”. Our hypothesis is that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to methods which use solely data, we hypothesize that our method will increase the ability to detect causal relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different subsets of LBD-derived information to GCMs at identifying drug side-effects. We will build causal models using increasing numbers confounders. The second aim is to test the ability of LBD built with disease-specific literature to improve the relevance of LBD derived confounders for Alzheimer's Disease (AD). We chose AD for its high prevalence and relative lack of effective pharmacologic treatment. (A) Compare LBD strategies in a disease-specific setting. We will test LBD variants using disease-specific literature or with LBD lacking subject- matter restrictions. (B) Define the ability of robust LBD-informed GCM to validate drug repurposing candidates for treating AD symptoms. We will test the ability of advanced methods to iteratively resolve hidden latent confounding, when detected, to improve effect estimates. The fulfillment of these aims will yield new methods to combine insights from the literature with causal modeling to uncover causal relationships of drug exposures on adverse events and on beneficial outcomes.

Key facts

NIH application ID: 10444995
Project number: 5K99LM013367-02
Recipient: UNIVERSITY OF PITTSBURGH AT PITTSBURGH
Principal Investigator: Scott Alexander Malec
Activity code: K99
Funding institute: NIH
Fiscal year: 2022
Award amount: $68,663
Award type: 5
Project period: 2021-08-01 → 2023-07-31