# Data-driven subtyping to find patients with drug interactions leading to stroke

> **NIH NIH F30** · COLUMBIA UNIVERSITY HEALTH SCIENCES · 2020 · $45,880

## Abstract

Project Summary
Stroke is a highly heterogeneous and complex disease and is a leading cause of morbidity and mortality
worldwide. Identification of the cause of disease is essential for risk stratification and optimal treatment, but can
be difficult, as up to 35% of causes are undetermined by traditional subtyping criteria and very few causative
genetic variants have been found. In addition, certain causes may be hidden within the clinical picture of a
patient, such as an adverse drug reaction. Using data-driven approaches to analyze the medical records of
patients may uncover novel patterns of risk factors and clinical features leading to stroke. The long-term goal of
this research is to identify novel subtypes of highly heterogeneous diseases such as stroke and to reduce the
genetic heterogeneity of a disease cohort by identifying patients with the same subtype. The objective of this
application is to propose a pipeline that applies a data-driven analysis of medical notes to identify novel
subtypes of stroke, focus on a subtype caused by an adverse drug reaction or drug pair interaction, validate
the subtype in a genotyped study cohort, and look for gene variant enrichment in this cohort. This application’s
central hypothesis is that applying deep learning to the electronic health record (EHR) of acute ischemic stroke
patients will form subtypes based on more granular information than currently implemented and with reduced
genetic heterogeneity by identifying novel patterns of risk factors and clinical picture leading to the stroke. In
addition, we hypothesize that at least one subtype will identify patients whose stroke is an adverse drug
reaction or drug-drug interaction. To do this, Aim 1 will first identify all acute ischemic stroke patients in the
EHR by developing a machine learning classifier trained on structured data in the EHR. Aim 2 will then build
and train an unsupervised deep learning algorithm on text from medical notes to identify clusters, or subtypes,
of patients with similar clinical pictures. Aim 3 will finally validate reduction in genetic heterogeneity of these
cohorts by estimating observational heritability of all subtypes using a tool created in our lab and comparing
this with the heritability estimates of subtypes derived from physician-based criteria. It will also focus on a not
well-studied subtype, stroke due to an adverse drug reaction or drug-drug interaction, by identifying its
enrichment in the novel subtypes, validating this subtype in a study cohort with genotyped data, and finally
looking for enrichment of pharmacogenetic variants in this subtype. These aims will generate a computational
pipeline that identifies novel subtypes of acute ischemic stroke, enabling improved future genetic studies by
reducing genetic heterogeneity of cohorts and improved understanding of the underlying causes of the
disease.

## Key facts

- **NIH application ID:** 9999028
- **Project number:** 5F30HL140946-03
- **Recipient organization:** COLUMBIA UNIVERSITY HEALTH SCIENCES
- **Principal Investigator:** Phyllis Mary Thangaraj
- **Activity code:** F30 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $45,880
- **Award type:** 5
- **Project period:** 2018-09-01 → 2021-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9999028

## Citation

> US National Institutes of Health, RePORTER application 9999028, Data-driven subtyping to find patients with drug interactions leading to stroke (5F30HL140946-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9999028. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
