Representing structural haplotypes and complex genetic variation in pan-genome graphs

NIH RePORTER · NIH · U01 · $267,573 · view on reporter.nih.gov ↗

Abstract

Project Summary The initial phase of sequencing the human pangenome has resulted in the assembly of a diverse collection of genomes. In parallel, an ecosystem of sophisticated computational methods were developed to organize the pangenome a graphical data structure that efficiently reflects the diversity of a global population, as well as sequence analysis methods required for geneticists to use the pangenome to improve how their studies are performed relative to a single reference genome. The pangenome revealed important factors about human genetic variation. In particular, there is a considerable amount of sequence diversity and novel sequences in the pangenome that arise from repetitive DNA. Because the initial methods developed to analyze variation in the pangenome were created for relatively simplistic variation outside of repetitive DNA, it is necessary to develop novel methods to discover, genotype, and organize, and validate variation in repetitive regions of the genome. The scope of this analysis spans short repeated DNA sequences that are hundreds of bases long, to entire regions that encompass genes. We will specifically develop methods to discover rare variation in variable-number tandem repeat sequences, and perform paralog-specific discovery of copy-number variation of genes. These methods will be developed to analyze short-read sequencing data so that large scale datasets such as those generated by TOPMed can take advantage of these methods to improve variant discovery in their cohorts. We will additionally develop methods to improve the representation of repetitive or rearranged sequences in the graphical representation of the pangenome. This will be accomplished by modeling the evolutionary relationships of repetitive sequences while building the graph, and validating assembly organization using public datasets from the single-cell sequencing technique, Strand-Seq. All of our development will be performed collaboratively with other members of the Human Pangenome Reference Consortium. We will share methods for variant discovery with other researchers who are studying large cohorts. Finally, any improvements in the pangenome graph will be released in coordination with production and other groups so that there is a standardized pangenome graph for other researchers in the public to base research from.

Key facts

NIH application ID: 10832934
Project number: 3U01HG010973-03S1
Recipient: UNIVERSITY OF SOUTHERN CALIFORNIA
Principal Investigator: Mark Chaisson
Activity code: U01
Funding institute: NIH
Fiscal year: 2023
Award amount: $267,573
Award type: 3
Project period: 2023-02-01 → 2024-01-31