# A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data

> **NIH NIH R01** · UNIV OF MARYLAND, COLLEGE PARK · 2022 · $294,981

## Abstract

PROJECT SUMMARY / ABSTRACT
 We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA-
seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes
and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA
molecules. State-of-the-art tools for quantifying RNA abundances are fast and efﬁcient, model and correct for common
technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and
statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quantiﬁcation
step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for
each transcript, the underlying biological variation in abundances across samples. While isolated tools fulﬁll a subset
of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time
leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the
current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and
dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata
throughout the analysis, including genome and transcriptome version, such that ﬁnal results cannot reliably be repro-
duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast
and lightweight tools have been quickly adopted for gene- and transcript-level quantiﬁcation, they are not yet optimized
for certain RNA-seq analysis tasks such as quantiﬁcation of allele speciﬁc expression. We have developed a set of top
performing tools for abundance quantiﬁcation and downstream inference. We propose to formalize our existing tools
into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty
from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on
the user's behalf — this metadata tagging and propagation will be integrated with community resources (described
in Aim 2). Furthermore, we propose building out the capabilities of our existing quantiﬁcation infrastructure to allow
for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

## Key facts

- **NIH application ID:** 10440402
- **Project number:** 5R01HG009937-06
- **Recipient organization:** UNIV OF MARYLAND, COLLEGE PARK
- **Principal Investigator:** Michael Isaiah Love
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $294,981
- **Award type:** 5
- **Project period:** 2020-03-12 → 2024-06-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10440402

## Citation

> US National Institutes of Health, RePORTER application 10440402, A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data (5R01HG009937-06). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10440402. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
