# Representing structural haplotypes and complex genetic variation in pan-genome graphs

> **NIH NIH U01** · UNIVERSITY OF SOUTHERN CALIFORNIA · 2020 · $389,008

## Abstract

Title: Representing structural haplotypes and complex genetic variation in pan-genome
graphs.
PROJECT SUMMARY
A pan-genome graph (PGG) reference must faithfully reflect structural haplotypes that differ in copy number,
order, and orientation, which are currently poorly represented in a linear reference sequence. This effort
focuses on the most copy variable and complex regions, including segmental duplications (SDs), inversions,
short tandem repeats/variable number tandem repeats (copy-number-variable repeats, CNVRs) and
combinations thereof that are frequently excluded or collapsed in reference genomes. The overarching goal of
this project is to develop the tool infrastructure enabling the construction of whole-chromosome reference
haplotypes that include all of these difficult classes of sequence. There are four specific aims. First, we will
develop methods to construct PGGs from haplotype-phased de novo assemblies, ensuring the graph reflects
both copy number variation and repeat structure, including CNVRs and SD. Second, we will develop software
that will expand SD assembly methods to facilitate the curation of SD loci in PGGs. We will use SD assembly
to detect variants specific to individual copies of a duplication, called paralog-specific variants (PSVs), and
provide software to reconstruct local haplotype paths through the PGG that describe the different copies. Third,
we will design novel methods to exploit single-cell template strand DNA sequencing data (Strand-seq) mapped
to PGGs in order to thread chromosome-length "structural haplotypes" through the graph. Therefore, our
software tool will allow the physical resolution of haplotypes comprising the full spectrum of structural variation,
including inversions and inverted duplications. By virtue of the PSVs, the structural haplotypes will also embed
sequence-resolved SDs. Fourth, we will develop a scalable open-source software framework to systematically
assess how the inclusion of single-nucleotide variants, short indels, and structural variant classes in the PGG
affects variant detection with short-read data. This will enable the optimization of the complexity encoded in the
PGG for short-read variant detection. It will additionally provide a comprehensive view on polymorphic and
fixed k-mers in human populations. We will develop tools to detect allele-specific k-mers and demonstrate how
that enables the rapid genotyping of variants in the PGG based on k-mer composition of a short-read dataset.
Once the framework for enhanced genome representation is established, we will focus on improving efficiency,
scalability, and computational ease to cater to the needs of a broad range of users in genetics and genome
science. This proposal will ensure that the most complex regions of the human genome are encoded into the
PGG and that underlying genetic variation is ultimately assessed for association with disease.
​

## Key facts

- **NIH application ID:** 9906038
- **Project number:** 1U01HG010973-01
- **Recipient organization:** UNIVERSITY OF SOUTHERN CALIFORNIA
- **Principal Investigator:** Mark Chaisson
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $389,008
- **Award type:** 1
- **Project period:** 2020-02-10 → 2023-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9906038

## Citation

> US National Institutes of Health, RePORTER application 9906038, Representing structural haplotypes and complex genetic variation in pan-genome graphs (1U01HG010973-01). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/9906038. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
