Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery

NIH RePORTER · NIH · R01 · $713,435 · view on reporter.nih.gov ↗

Abstract

Project Summary Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come.

Key facts

NIH application ID: 10147905
Project number: 5R01HG006677-20
Recipient: JOHNS HOPKINS UNIVERSITY
Principal Investigator: Steven L. Salzberg
Activity code: R01
Funding institute: NIH
Fiscal year: 2021
Award amount: $713,435
Award type: 5
Project period: 1999-09-01 → 2025-02-28