# Integrating genomic and clinical data to predict disease phenotypes using heterogeneous ensembles

> **NIH NIH R01** · ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI · 2022 · $535,451

## Abstract

PROJECT SUMMARY
Genomic and other “omic” profiles hold immense potential for advancing personalize/precision medicine by
enabling the accurate prediction of disease phenotypes or outcomes for individual patients, which can be used
by a clinician to design an appropriate plan of care. However, despite this potential, the actual impact of these
omic profiles on disease phenotype prediction may be limited by the fact that even large cohorts collecting
these data do not cover large enough numbers of individuals. In contrast, a variety of clinical data types, such
as laboratory tests and physician notes, are routinely collected and studied for a much larger number of
patients undergoing treatment for such diseases at medical centers. The abundance of these clinical data, and
their complementarity with multi-omic data, offer an opportunity to advance personalized medicine by
integrating these disparate types of data. However, this disparity in data formats, namely several omic profiles
being structured, and several clinical data types, such as physician notes, being unstructured, poses
challenges for this integration. An associated challenge due to this disparity is that different classes of
computational methods are likely to be the most effective for predicting disease phenotypes from these clinical
and omics datasets. These challenges pose barriers for current data integration methods to address this
problem. Here, we propose an innovative approach to this integration by assimilating diverse base phenotype
predictors inferred from individual clinical and omics datasets into heterogeneous ensembles. These
ensembles, which have shown promise for several other computational genomics problems, can aggregate an
unrestricted number and variety of base predictors, which is ideal for this integration problem. Specifically, we
describe how existing heterogeneous ensemble methods for single datasets can be transformed and advanced
to address the multiple clinical and omic dataset integration problem. In particular, we detail novel algorithms
for improving these integrative ensembles by modeling and incorporating the inherent patient and dataset
heterogeneity in these datasets. We also propose novel algorithms for leveraging the inherent complementarity
among clinical and omic datasets, as well as an innovative approach for handling expected missing data, both
with the goal of making ensemble phenotype predictors more accurate and applicable to patient cohorts. To
assess the performance of this novel suite of data integration-oriented heterogeneous ensembles, we will
validate their effectiveness for predicting asthma and Inflammatory Bowel Disease phenotypes in substantial
patient cohorts with diverse omics and clinical datasets. We will publicly release efficient software
implementations of the methods developed in this project to enable others to carry out similar analyses with
other diverse data collections. Successful accomplishment of the proposed work wi...

## Key facts

- **NIH application ID:** 10409755
- **Project number:** 5R01HG011407-02
- **Recipient organization:** ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI
- **Principal Investigator:** Gaurav Pandey
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $535,451
- **Award type:** 5
- **Project period:** 2021-06-01 → 2025-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10409755

## Citation

> US National Institutes of Health, RePORTER application 10409755, Integrating genomic and clinical data to predict disease phenotypes using heterogeneous ensembles (5R01HG011407-02). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10409755. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*