Robust Identification and accurate quantification of RNA transcripts on a system wide scale

NIH RePORTER · NIH · R01 · $334,584 · view on reporter.nih.gov ↗

Abstract

Project Summary Next-generation, Illumina RNA sequencing (RNA-seq) is by far the most widely used assay for investigating animal transcriptomes, and numerous public RNA-seq data sets have been generated for various biological conditions in multiple species. However, there remain several barriers in using short RNA-seq reads to accurately identify the splicing structures and quantify the abundances of full-length RNA transcripts. In this proposal, we will develop a series of novel statistical and computational methods to improve the robustness of transcript identification and the accuracy of transcript quantification from Illumina RNA-seq data. (Aim 1) We will develop a novel screening method to construct transcript candidates by first detecting sparse splicing structures from multiple RNA-seq data sets for a given biological condition. These transcript candidates will significantly reduce the search space of downstream transcript identification methods and hence improve their precision. (Aim 2) We will develop a robust transcript identification method to identify novel transcripts in a conservative manner from RNA-seq data given existing annotations. Our method will be based on statistical model selection under the Neyman-Pearson paradigm, which will allow users to control the false positive rate of our identified novel transcripts under any given threshold with high probability. (Aim 3) We will develop an accurate transcript quantification method to effectively leverage multiple RNA-seq data sets and to simultaneously assess the data quality based on low-throughput gold standards and cross-data similarities. All of these methods will be first used to study transcripts in mouse macrophage, for which gold standard qPCR and full length cDNA sequences will be generated for training and method validation. The methods will then be more broadly tested in other biological systems where suitable gold standard data is available. Our methods and software will significantly facilitate the use of Illumina RNA-seq data for gene expression studies at the transcript level, increase reproducibility of scientific discoveries from transcriptomic studies, and improve our understanding of gene expression mechanisms in various biological conditions.

Key facts

NIH application ID: 9974525
Project number: 5R01GM120507-05
Recipient: UNIVERSITY OF CALIFORNIA LOS ANGELES
Principal Investigator: Jingyi Jessica Li
Activity code: R01
Funding institute: NIH
Fiscal year: 2020
Award amount: $334,584
Award type: 5
Project period: 2016-09-01 → 2022-05-31