# Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts

> **NIH NIH F31** · UNIVERSITY OF PENNSYLVANIA · 2021 · $46,036

## Abstract

Abstract
Analysis of RNA sequencing (RNASeq) data obtained from large patient cohorts can reveal transcriptomic
perturbations that are associated with complex disease and facilitate the identification of disease subtypes.
This is typically framed as an unsupervised learning task to discover latent structure in a matrix of RNASeq
based quantification of gene expression or local splicing variations (LSVs). However, several factors make
analysis of such heterogeneous data challenging. First, such datasets are comprised of samples processed at
multiple institutions which might employ different sequencing protocols and quality control steps. This
introduces confounding factors into the data like inconsistent sample quality or variable cell type proportions
which can hinder detection of true biological signal. Second, in acute myeloid leukemia (AML), mutations in
splice factor genes occurring in a subset of the patients may only result in alteration of a subset of coregulated
splicing events. Thus, instead of measuring global similarity between samples based on all transcriptomic
features, there is a need to efficiently identify “tiles”, defined by a subset of samples and splicing events with
abnormal signals. Although several algorithms have been proposed for this task, they fail to overcome many of
the computational challenges associated with modeling splicing data and are not well suited to handle missing
values.
To facilitate analysis of heterogeneous splicing datasets by reducing false positive discoveries and boosting
true biological signal, we will first develop a model to correct for the effects of RNA degradation and cell type
mixtures. Then in order to efficiently identify AML subtypes characterized by splicing events and account for
splicing specific modeling challenges, we propose CHESSBOARD (Characterizing Heterogeneity of
Expression and Splicing by Search for Blocks of Abnormalities and Outliers in RNA Datasets), a non-
parametric Bayesian model for unsupervised discovery of tiles. We will apply our models to synthetic datasets
and show it outperforms several baseline approaches. Next, we will show that it recovers tiles characterized by
known and novel splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we will
show that tiles discovered are correlated with drug response to therapeutics, pointing to the translational
impact of our findings.

## Key facts

- **NIH application ID:** 10315802
- **Project number:** 1F31CA265218-01
- **Recipient organization:** UNIVERSITY OF PENNSYLVANIA
- **Principal Investigator:** David Wang
- **Activity code:** F31 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $46,036
- **Award type:** 1
- **Project period:** 2021-08-01 → 2024-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10315802

## Citation

> US National Institutes of Health, RePORTER application 10315802, Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts (1F31CA265218-01). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10315802. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*