# Integrative data science approaches for rare disease discovery in health records

> **NIH NIH R00** · ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI · 2024 · $236,394

## Abstract

ABSTRACT: There are nearly 7,000 diseases that have a prevalence of only one in 2,000 individuals or less.
Yet, such rare diseases are estimated to collectively affect over 300 million people worldwide, representing a
significant healthcare concern. Although rare diseases have predominantly genetic origins, nearly half of them
do not manifest symptoms until adulthood and frequently confound discovery and diagnosis. Even in the case
of early onset disorders, the sheer number of possible diagnoses can often overwhelm clinicians. As a result,
rare diseases are often diagnosed with delay, misdiagnosed or even remain undiagnosed, not only disrupting
patient lives but also hindering progress on our understanding of such diseases. Data science methods that
mine large-scale retrospective health record data for phenotypic information will aid in timely and accurate
diagnoses of rare diseases, especially when combined with additional data types, thus, having significant real-
world impact. This proposal will integrate electronic health record (EHR) data sets with publicly available
vocabularies and ontologies, and genomic data for the improved identification and characterization of patients
with rare diseases, using approaches from machine learning, natural language processing (NLP) and basic
bioinformatics. The work has three specific aims and will be carried out in two phases. During the mentored
phase, the principal investigator (PI) will develop data-driven methods to extract standardized concepts related
to rare diseases from clinical notes and infer the occurrence of each disease (Aim 1). He will also develop data
science approaches to compare and contrast longitudinal patterns associated with patients' journeys through
the healthcare system when seeking a diagnosis for a rare disease, and aid in clinical decision-making by
leveraging these patterns (Aim 2). During the independent phase (Aim 3), computational methods will be
developed for the integrated modeling and analysis of genotypic (from Aim 3) and phenotypic information (from
Aims 1 and 2). Cohorts to be sequenced will cover diseases for which causal genes or disease definitions are
unclear (discovery), as well as those for which these are well known (validation). This work will be carried out
under the mentorship of four faculty members with complementary expertise in biomedical informatics, data
science, NLP, and rare disease genomics at the University of Washington, the largest medical system in the
Pacific Northwest (four million EHRs), world-renowned researchers in medical genetics, and a robust data
science environment. In addition, under the direction of the mentoring team, the PI will complete advanced
coursework, receive training in translational bioinformatics and clinical research informatics, submit
manuscripts, and seek an independent research position. This proposal will yield preliminary results for
subsequent studies on data-driven phenotyping and enable the realization of the ...

## Key facts

- **NIH application ID:** 10839995
- **Project number:** 5R00LM012992-05
- **Recipient organization:** ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI
- **Principal Investigator:** Vikas Rao Pejaver
- **Activity code:** R00 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $236,394
- **Award type:** 5
- **Project period:** 2022-06-01 → 2026-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10839995

## Citation

> US National Institutes of Health, RePORTER application 10839995, Integrative data science approaches for rare disease discovery in health records (5R00LM012992-05). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10839995. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
