# Computational Methods for Sequence Alignment, Genotyping, and Diploid Genome Assembly

> **NIH NIH R01** · UT SOUTHWESTERN MEDICAL CENTER · 2021 · $410,093

## Abstract

Project Summary/Abstract
Massive sequencing is revolutionizing biological research and clinical practice. Over the past decades, projects
such as the 1,000 Genomes Project, TCGA, GTEx, and GEUVADIS have generated hundreds of trillions of
reads. The recent completion of the UK’s 100K WGS project has inspired many other nations to develop their
own 100K WGS projects. The improvements in throughput and reduced costs of sequencing have enabled more
thorough and deeper studies of cancer, genetic disorders, and other areas of human biology. Advanced
sequencing alignment and computational methodologies have played a major role in conducting these
analyses. In recent years, our lab has contributed to these global scale and unprecedented endeavors by
developing several widely used bioinformatics tools for analyzing NGS sequencing reads: TopHat2 and
HISAT for aligning RNA-seq reads, TopHat-Fusion for identifying gene fusions, Centrifuge for classifying
metagenomics sequencing reads, HISAT2 for graph alignment at the human genome scale, and HISAT-
genotype for HLA gene typing and assembly.
 This proposal addresses several key challenges in the areas of sequence alignment, genotyping, and
diploid genome assembly. First, we plan to research and develop various indexing strategies. Virtually all
alignment programs rely on one type of index for aligning reads to a reference. Alignment accuracy and speed
will be further enhanced by incorporating additional types of indexes. Second, we will develop genotyping and
diploid genome assembly algorithms. As sequencing costs continue to decline, it will become routine for people
to have their own genomes sequenced for clinical purposes. We will further develop our initial version of
HISAT-genotype into a comprehensive suite of tools that can genotype and assemble a person’s whole diploid
genome in one day on a desktop. Third, we will continue to maintain and improve HISAT2, and develop a new
more versatile aligner. We propose to unify widely used alignment programs by developing several common
functions of alignment programs (input processing, indexing, aligning, and reporting) as modules and provide
application programming interfaces (APIs) that expose those modules, enabling bioinformatics engineers to
use the APIs for developing their own indexes and alignment algorithms that are customized for best analyzing
their own data sets. We plan to demonstrate the usability of the new sequence aligner, SARTOR (Sequence
Alignment Repertoire To Optimize Reference-guided analysis), by effectively handling different types of reads
(WGS, WES, RNA-seq, ChIP-seq, BS-seq, etc.,) produced by different sequencing technologies (short, long, and
linked reads). Upon successful completion, the proposed software systems will promote personalized medicine
by drawing upon customized personal genomes, with key functionalities including differential gene expression
analysis and somatic mutation identification. The programs will also allow rese...

## Key facts

- **NIH application ID:** 10242898
- **Project number:** 5R01GM135341-03
- **Recipient organization:** UT SOUTHWESTERN MEDICAL CENTER
- **Principal Investigator:** Daehwan Kim
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $410,093
- **Award type:** 5
- **Project period:** 2019-09-23 → 2023-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10242898

## Citation

> US National Institutes of Health, RePORTER application 10242898, Computational Methods for Sequence Alignment, Genotyping, and Diploid Genome Assembly (5R01GM135341-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10242898. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
