# Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery

> **NIH NIH R01** · JOHNS HOPKINS UNIVERSITY · 2020 · $711,987

## Abstract

Project Summary
Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing
to answer a wide range of questions in biology and medicine. Thousands of new human genomes are
being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel
with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due
to its power to characterize gene expression in a multitude of cell types and conditions, and to its
potential to discover new genes and new splice variants. These enormous data sets require highly
efficient and accurate computational methods for analysis, and they also presents opportunities for
discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no
longer afford to rely on a single reference genome that is missing much of the variation found in the
human population, and that makes it very difficult to analyze sequences that do not match the
reference. We propose to address these challenges in four specific ways: first, we will develop new and
improved assembly algorithms that take advantage of the latest long-read technology to create
genomes of unprecedented contiguity and completeness. This effort will include a method for creating
haplotype-resolved assemblies when sequences from both parents are available, and a method to use
an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will
apply these methods to build new human reference genomes, assembled and annotated as thoroughly
as the current human reference. These genomes, each representing a single individual, can then serve
as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq
analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and
two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many
thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel
network flow algorithm with de novo assembly plus new alignment methods to handle long reads and
to improve its construction and quantification of transcripts. Fourth, we propose to systematically
assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene
catalog, an effort that could have a major impact on a broad array of human genetic and genomic
studies. We have recently released our first version of this effort as CHESS, a human gene catalog
built from a massive RNA-seq database that represents a comprehensive, reproducible, and open
method for annotating the human genome. The CHESS database already agrees more closely with the
two most widely-used human gene databases than either of them agree with one another, and we will
improve it further so that it can provide a basis for biomedical research for many years to come.

## Key facts

- **NIH application ID:** 9965200
- **Project number:** 2R01HG006677-19
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Steven L. Salzberg
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $711,987
- **Award type:** 2
- **Project period:** 1999-09-01 → 2025-02-28

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9965200

## Citation

> US National Institutes of Health, RePORTER application 9965200, Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery (2R01HG006677-19). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9965200. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*