# Representing structural haplotypes and complex genetic variation in pan-genome graphs

> **NIH NIH U01** · UNIVERSITY OF SOUTHERN CALIFORNIA · 2023 · $267,573

## Abstract

Project Summary
The initial phase of sequencing the human pangenome has resulted in the assembly of a
diverse collection of genomes. In parallel, an ecosystem of sophisticated computational
methods were developed to organize the pangenome a graphical data structure that efficiently
reflects the diversity of a global population, as well as sequence analysis methods required for
geneticists to use the pangenome to improve how their studies are performed relative to a
single reference genome. The pangenome revealed important factors about human genetic
variation. In particular, there is a considerable amount of sequence diversity and novel
sequences in the pangenome that arise from repetitive DNA. Because the initial methods
developed to analyze variation in the pangenome were created for relatively simplistic variation
outside of repetitive DNA, it is necessary to develop novel methods to discover, genotype, and
organize, and validate variation in repetitive regions of the genome. The scope of this analysis
spans short repeated DNA sequences that are hundreds of bases long, to entire regions that
encompass genes. We will specifically develop methods to discover rare variation in
variable-number tandem repeat sequences, and perform paralog-specific discovery of
copy-number variation of genes. These methods will be developed to analyze short-read
sequencing data so that large scale datasets such as those generated by TOPMed can take
advantage of these methods to improve variant discovery in their cohorts. We will additionally
develop methods to improve the representation of repetitive or rearranged sequences in the
graphical representation of the pangenome. This will be accomplished by modeling the
evolutionary relationships of repetitive sequences while building the graph, and validating
assembly organization using public datasets from the single-cell sequencing technique,
Strand-Seq. All of our development will be performed collaboratively with other members of the
Human Pangenome Reference Consortium. We will share methods for variant discovery with
other researchers who are studying large cohorts. Finally, any improvements in the pangenome
graph will be released in coordination with production and other groups so that there is a
standardized pangenome graph for other researchers in the public to base research from.

## Key facts

- **NIH application ID:** 10832934
- **Project number:** 3U01HG010973-03S1
- **Recipient organization:** UNIVERSITY OF SOUTHERN CALIFORNIA
- **Principal Investigator:** Mark Chaisson
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $267,573
- **Award type:** 3
- **Project period:** 2023-02-01 → 2024-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10832934

## Citation

> US National Institutes of Health, RePORTER application 10832934, Representing structural haplotypes and complex genetic variation in pan-genome graphs (3U01HG010973-03S1). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10832934. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*