Computational Methods for Sequence Alignment, Genotyping, and Diploid Genome Assembly

NIH RePORTER · NIH · R01 · $410,093 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract Massive sequencing is revolutionizing biological research and clinical practice. Over the past decades, projects such as the 1,000 Genomes Project, TCGA, GTEx, and GEUVADIS have generated hundreds of trillions of reads. The recent completion of the UK’s 100K WGS project has inspired many other nations to develop their own 100K WGS projects. The improvements in throughput and reduced costs of sequencing have enabled more thorough and deeper studies of cancer, genetic disorders, and other areas of human biology. Advanced sequencing alignment and computational methodologies have played a major role in conducting these analyses. In recent years, our lab has contributed to these global scale and unprecedented endeavors by developing several widely used bioinformatics tools for analyzing NGS sequencing reads: TopHat2 and HISAT for aligning RNA-seq reads, TopHat-Fusion for identifying gene fusions, Centrifuge for classifying metagenomics sequencing reads, HISAT2 for graph alignment at the human genome scale, and HISAT- genotype for HLA gene typing and assembly. This proposal addresses several key challenges in the areas of sequence alignment, genotyping, and diploid genome assembly. First, we plan to research and develop various indexing strategies. Virtually all alignment programs rely on one type of index for aligning reads to a reference. Alignment accuracy and speed will be further enhanced by incorporating additional types of indexes. Second, we will develop genotyping and diploid genome assembly algorithms. As sequencing costs continue to decline, it will become routine for people to have their own genomes sequenced for clinical purposes. We will further develop our initial version of HISAT-genotype into a comprehensive suite of tools that can genotype and assemble a person’s whole diploid genome in one day on a desktop. Third, we will continue to maintain and improve HISAT2, and develop a new more versatile aligner. We propose to unify widely used alignment programs by developing several common functions of alignment programs (input processing, indexing, aligning, and reporting) as modules and provide application programming interfaces (APIs) that expose those modules, enabling bioinformatics engineers to use the APIs for developing their own indexes and alignment algorithms that are customized for best analyzing their own data sets. We plan to demonstrate the usability of the new sequence aligner, SARTOR (Sequence Alignment Repertoire To Optimize Reference-guided analysis), by effectively handling different types of reads (WGS, WES, RNA-seq, ChIP-seq, BS-seq, etc.,) produced by different sequencing technologies (short, long, and linked reads). Upon successful completion, the proposed software systems will promote personalized medicine by drawing upon customized personal genomes, with key functionalities including differential gene expression analysis and somatic mutation identification. The programs will also allow rese...

Key facts

NIH application ID: 10242898
Project number: 5R01GM135341-03
Recipient: UT SOUTHWESTERN MEDICAL CENTER
Principal Investigator: Daehwan Kim
Activity code: R01
Funding institute: NIH
Fiscal year: 2021
Award amount: $410,093
Award type: 5
Project period: 2019-09-23 → 2023-08-31