Long-read sequence and assembly of segmental duplications

NIH RePORTER · NIH · R01 · $593,478 · view on reporter.nih.gov ↗

Abstract

ABSTRACT The completion of the first genome has revealed a complex pattern of recent duplications which contributes disproportionately to human genetic variation. Our genome is particularly enriched for interspersed segmental duplications, which harbor rapidly evolving genes and predispose our species to copy number variation and recurrent rearrangements associated with disease. Their length, sequence identity and structural variation, however, still complicate genome assembly and represent a major impediment to generation of telomere-to- telomere assemblies of human genomes. The long-term objective of this research program has been to develop computational and experimental methods to understand the organization, genetic diversity, and disease impact of segmental duplications. The goal of this competing renewal is to apply long-read sequencing technologies with graph-based approaches to resolve the most complex regions in hundreds of human and ape genomes. There are four aims: (1) determine the sequence structure of these recent duplications in humans by generating complete high-quality reference sequences by coupling orthogonal long-read sequencing technologies; (2) understand the genetic diversity of this structure by focusing on the most dynamic and problematic gene-rich regions in more than 350 human and a diversity of non-human ape genomes; (3) generate matched DNA and RNA long-read data to explore the transcriptional potential and epigenetic features of segmental duplications in the human genome; and (4) develop a graph-based genotyper specifically optimized to assay copy number polymorphic duplicated loci in short-read whole genome sequence data allowing their diversity to be explored more systematically. This work will provide fundamental new insights into the structural complexity of human genomes and the mutational processes that have shaped them. It will identify new copy-number polymorphic genes and their distribution among human populations as well as our first assessment of how such genomic regions are regulated and lead to the emergence of new genes. This research has the additional benefit that it will add new sequence to reference genomes, facilitate more routine telomere-to-telomere assembly, and provide us with the ability to systematically explore genetic variation of regions frequently overlooked as part of disease-association studies.

Key facts

NIH application ID
10841602
Project number
5R01HG002385-23
Recipient
UNIVERSITY OF WASHINGTON
Principal Investigator
Evan Eichler
Activity code
R01
Funding institute
NIH
Fiscal year
2024
Award amount
$593,478
Award type
5
Project period
2001-09-21 → 2027-04-30