Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts

NIH RePORTER · NIH · F31 · $46,036 · view on reporter.nih.gov ↗

Abstract

Abstract Analysis of RNA sequencing (RNASeq) data obtained from large patient cohorts can reveal transcriptomic perturbations that are associated with complex disease and facilitate the identification of disease subtypes. This is typically framed as an unsupervised learning task to discover latent structure in a matrix of RNASeq based quantification of gene expression or local splicing variations (LSVs). However, several factors make analysis of such heterogeneous data challenging. First, such datasets are comprised of samples processed at multiple institutions which might employ different sequencing protocols and quality control steps. This introduces confounding factors into the data like inconsistent sample quality or variable cell type proportions which can hinder detection of true biological signal. Second, in acute myeloid leukemia (AML), mutations in splice factor genes occurring in a subset of the patients may only result in alteration of a subset of coregulated splicing events. Thus, instead of measuring global similarity between samples based on all transcriptomic features, there is a need to efficiently identify “tiles”, defined by a subset of samples and splicing events with abnormal signals. Although several algorithms have been proposed for this task, they fail to overcome many of the computational challenges associated with modeling splicing data and are not well suited to handle missing values. To facilitate analysis of heterogeneous splicing datasets by reducing false positive discoveries and boosting true biological signal, we will first develop a model to correct for the effects of RNA degradation and cell type mixtures. Then in order to efficiently identify AML subtypes characterized by splicing events and account for splicing specific modeling challenges, we propose CHESSBOARD (Characterizing Heterogeneity of Expression and Splicing by Search for Blocks of Abnormalities and Outliers in RNA Datasets), a non- parametric Bayesian model for unsupervised discovery of tiles. We will apply our models to synthetic datasets and show it outperforms several baseline approaches. Next, we will show that it recovers tiles characterized by known and novel splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we will show that tiles discovered are correlated with drug response to therapeutics, pointing to the translational impact of our findings.

Key facts

NIH application ID: 10315802
Project number: 1F31CA265218-01
Recipient: UNIVERSITY OF PENNSYLVANIA
Principal Investigator: David Wang
Activity code: F31
Funding institute: NIH
Fiscal year: 2021
Award amount: $46,036
Award type: 1
Project period: 2021-08-01 → 2024-07-31