Detection and genotyping complex human genetic variation using single-molecule sequencing

NIH RePORTER · NIH · R01 · $412,500 · view on reporter.nih.gov ↗

Abstract

Project summary Although single-molecule sequencing (SMS) technologies have advanced in recent years to enable routine sequencing and assembly of human genomes, new software is required to utilize the potential of SMS in human genetics. The long term goal is to help improve our understanding of complex variation in human diversity and its role in disease. To achieve this, we will develop methods to (1) detect variation in SMS reads, (2) assemble duplicated sequences missing from SMS de novo assemblies, and (3) genotype complex variation in large HTS datasets using lightweight data structures. While several years of algorithm development for SMS data have resulted in an software ecosystem to detect variation in SMS genomes, the rationale for the need to continue development is that sensitivity and specificity are not yet sufficient for disease studies, important classes of variation are not resolved by current assembly approaches, and the knowledge gained from sequencing SMS genomes must be used to improve what can be discovered in large disease studies that rely heavily on short read data such as those conducted under TOPMed. The algorithmic innovations we will provide for SMS data are an alignment algorithm that explicitly optimizes over rearranged sequences, an assembly approach that exploits minor differences between duplication copies to resolve genome function. Software will be supported through Bioconda installation and distributed test cases. Once a variant is discovered by SMS, it may be more easily genotyped in short read data. We will develop methods to generate databases of SMS variation that may be queried with short read data. To aid in development of assembly algorithms for duplicated sequences, we will generate a public resource of SMS data for individuals with known copy number polymorphisms. The significance of this work is to enable SMS genomes to be used in disease studies, both by uncovering previously hidden variation, and by increasing the amount of variation found in large short-read datasets.

Key facts

NIH application ID: 10186109
Project number: 1R01HG011649-01
Recipient: UNIVERSITY OF SOUTHERN CALIFORNIA
Principal Investigator: Mark Chaisson
Activity code: R01
Funding institute: NIH
Fiscal year: 2021
Award amount: $412,500
Award type: 1
Project period: 2021-07-15 → 2026-04-30