Integrating genomic and clinical data to predict disease phenotypes using heterogeneous ensembles

NIH RePORTER · NIH · R01 · $539,951 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY Genomic and other “omic” profiles hold immense potential for advancing personalize/precision medicine by enabling the accurate prediction of disease phenotypes or outcomes for individual patients, which can be used by a clinician to design an appropriate plan of care. However, despite this potential, the actual impact of these omic profiles on disease phenotype prediction may be limited by the fact that even large cohorts collecting these data do not cover large enough numbers of individuals. In contrast, a variety of clinical data types, such as laboratory tests and physician notes, are routinely collected and studied for a much larger number of patients undergoing treatment for such diseases at medical centers. The abundance of these clinical data, and their complementarity with multi-omic data, offer an opportunity to advance personalized medicine by integrating these disparate types of data. However, this disparity in data formats, namely several omic profiles being structured, and several clinical data types, such as physician notes, being unstructured, poses challenges for this integration. An associated challenge due to this disparity is that different classes of computational methods are likely to be the most effective for predicting disease phenotypes from these clinical and omics datasets. These challenges pose barriers for current data integration methods to address this problem. Here, we propose an innovative approach to this integration by assimilating diverse base phenotype predictors inferred from individual clinical and omics datasets into heterogeneous ensembles. These ensembles, which have shown promise for several other computational genomics problems, can aggregate an unrestricted number and variety of base predictors, which is ideal for this integration problem. Specifically, we describe how existing heterogeneous ensemble methods for single datasets can be transformed and advanced to address the multiple clinical and omic dataset integration problem. In particular, we detail novel algorithms for improving these integrative ensembles by modeling and incorporating the inherent patient and dataset heterogeneity in these datasets. We also propose novel algorithms for leveraging the inherent complementarity among clinical and omic datasets, as well as an innovative approach for handling expected missing data, both with the goal of making ensemble phenotype predictors more accurate and applicable to patient cohorts. To assess the performance of this novel suite of data integration-oriented heterogeneous ensembles, we will validate their effectiveness for predicting asthma and Inflammatory Bowel Disease phenotypes in substantial patient cohorts with diverse omics and clinical datasets. We will publicly release efficient software implementations of the methods developed in this project to enable others to carry out similar analyses with other diverse data collections. Successful accomplishment of the proposed work wi...

Key facts

NIH application ID
10218766
Project number
1R01HG011407-01A1
Recipient
ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI
Principal Investigator
Gaurav Pandey
Activity code
R01
Funding institute
NIH
Fiscal year
2021
Award amount
$539,951
Award type
1
Project period
2021-06-01 → 2025-03-31