A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data

NIH RePORTER · NIH · R01 · $294,981 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY / ABSTRACT We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA- seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA molecules. State-of-the-art tools for quantifying RNA abundances are fast and efﬁcient, model and correct for common technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quantiﬁcation step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for each transcript, the underlying biological variation in abundances across samples. While isolated tools fulﬁll a subset of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata throughout the analysis, including genome and transcriptome version, such that ﬁnal results cannot reliably be repro- duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast and lightweight tools have been quickly adopted for gene- and transcript-level quantiﬁcation, they are not yet optimized for certain RNA-seq analysis tasks such as quantiﬁcation of allele speciﬁc expression. We have developed a set of top performing tools for abundance quantiﬁcation and downstream inference. We propose to formalize our existing tools into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on the user's behalf — this metadata tagging and propagation will be integrated with community resources (described in Aim 2). Furthermore, we propose building out the capabilities of our existing quantiﬁcation infrastructure to allow for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

Key facts

NIH application ID: 10238765
Project number: 5R01HG009937-05
Recipient: UNIV OF MARYLAND, COLLEGE PARK
Principal Investigator: Michael Isaiah Love
Activity code: R01
Funding institute: NIH
Fiscal year: 2021
Award amount: $294,981
Award type: 5
Project period: 2020-03-12 → 2023-06-30