Advanced computational methods in analyzing high-throughput sequencing data

NIH RePORTER · NIH · R01 · $397,125 · view on reporter.nih.gov ↗

Abstract

Sequencing technologies have become an essential tool to the study of human evolution, to the understanding of the genetic bases of diseases and to the clinical detection and treatment of genetic disorders. Computational algorithms are indispensible to the analysis of large-scale sequencing data and have received broad attention. However, developed several years ago, many mainstream software packages for sequence alignment, assembly and variant calling have gradually lagged behind the rapid development of sequencing technologies. They are unable to process the latest long reads or assembled contigs, and will be outpaced by upcoming technologies in terms of throughput. The development of advanced algorithms is critical to the applications of sequencing technologies in the near future. This project will address this pressing need with four proposals: (1) developing a fast and accurate aligner that accelerates short-read alignment and can map megabase-long assemblies against large sequence collections of over 100 gigabases in size; (2) developing an integrated caller for small sequence variations that is faster to run, more sensitive to moderately longer insertions and more accessible to biologists without extended expertise in bioinformatics; (3) developing a generic variant filtering tool that uses a novel deep learning model to achieve human-level accuracy on identifying false positive calls; (4) developing a new de novo assembler that works with the latest nanopore reads of ~100 kilobases in length and may achieve good contiguity at low coverage. Upon completion, the proposed studies will dramatically reduce the computational cost of data processing in most research labs and commercial entities, and will enable the applications of long reads in genome assembly, in the study of structural variations and in cancer researches.

Key facts

NIH application ID
10113656
Project number
5R01HG010040-05
Recipient
DANA-FARBER CANCER INST
Principal Investigator
Heng Li
Activity code
R01
Funding institute
NIH
Fiscal year
2021
Award amount
$397,125
Award type
5
Project period
2018-11-16 → 2022-02-28