A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data

NIH RePORTER · NIH · R01 · $294,981 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY / ABSTRACT We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA- seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA molecules. State-of-the-art tools for quantifying RNA abundances are fast and efficient, model and correct for common technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quantification step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for each transcript, the underlying biological variation in abundances across samples. While isolated tools fulfill a subset of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata throughout the analysis, including genome and transcriptome version, such that final results cannot reliably be repro- duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast and lightweight tools have been quickly adopted for gene- and transcript-level quantification, they are not yet optimized for certain RNA-seq analysis tasks such as quantification of allele specific expression. We have developed a set of top performing tools for abundance quantification and downstream inference. We propose to formalize our existing tools into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on the user's behalf — this metadata tagging and propagation will be integrated with community resources (described in Aim 2). Furthermore, we propose building out the capabilities of our existing quantification infrastructure to allow for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

Key facts

NIH application ID
10238765
Project number
5R01HG009937-05
Recipient
UNIV OF MARYLAND, COLLEGE PARK
Principal Investigator
Michael Isaiah Love
Activity code
R01
Funding institute
NIH
Fiscal year
2021
Award amount
$294,981
Award type
5
Project period
2020-03-12 → 2023-06-30