Computational methods for variant calling and haplotyping using long-read sequencing technologies

NIH RePORTER · NIH · R01 · $380,962 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract In this project, we propose to develop computational methods and tools for whole-genome haplotyping and small variant calling using long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore and linked-read technologies. Haplotype information is crucial for interpretation of genetic variation in individual genomes, disease mapping, clinical genomics and several other analysis of human genetic variation. The lack of phase or haplotype information in human genomes sequenced using short reads is a major barrier in identifying disease associations with compound heterozygous mutations. More than 600 genes overlap segmental duplications with high sequence identity and variants in more than 100 such genes have been associated with rare Mendelian disorders and complex diseases including cancer. The inability to detect variants with high accuracy in duplicated regions of the genome using short-read sequencing technologies reduces the ability to identify disease causing mutations in medical genetics studies. In Aim 1, we will develop a general computational method for long-read based diploid genotyping that will enable accurate haplotyping for single nucleotide variants and short indels using long-read and linked-reads as well as accurate small variant calling using SMS technologies. In Aim 2, we will develop computational methods for sensitive mapping of SMS reads and accurate variant calling in repetitive regions of the human genome that are currently excluded from benchmark small variant call sets for reference human genomes. Finally, in Aim 3, we will leverage the methods from Aims 1 and 2 to perform variant calling on multiple genomes sequenced using SMS technologies to catalog variant PSVs and leverage this catalog to improve read mapping and variant calling accuracy of short-read sequencing in repetitive regions of the genome. We will implement the methods in robust and computationally efficient software tools and benchmark their accuracy using publicly available long- read sequence datasets for multiple human genomes of diverse ancestries.

Key facts

NIH application ID: 10657420
Project number: 5R01HG010759-04
Recipient: UNIVERSITY OF CALIFORNIA, SAN DIEGO
Principal Investigator: Vikas Bansal
Activity code: R01
Funding institute: NIH
Fiscal year: 2023
Award amount: $380,962
Award type: 5
Project period: 2020-09-01 → 2025-06-30