The construction and utility of reference pan-genome graphs

NIH RePORTER · NIH · U01 · $800,000 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY The current human reference genome, GRCh38, plays a central role in medical and population human genetics. It primarily models a single human individual and is missing hundreds of thousands of large structural variations segregating in the population. This underrepresentation of genetic diversity leads to various artifacts in data analysis and significantly hampers our understanding of the functional and medical relevance of these large human variations, which may collectively have pervasive impact. To address this issue, we will extend our previous work on sequence graphs and alignment algorithms and construct a pan-genome reference graph from hundreds of long-read human assemblies that more completely represent genetic diversity. Specifically, we will (1) design a reference graph model with a stable coordinate system compatible with GRCh38 and develop toolkits and libraries to interact with this model; (2) develop minimizer-based sequence-to-graph alignment algorithms for short and long sequences; (3) incrementally construct a reference graph by mapping assemblies to an existing graph and updating the graph; and (4) develop a graph-based genotyping algorithm and apply it to short-read based projects to call structural variations missed by the current pipelines. Upon completion, the proposed project could replace the current practices based on a linear genome and will enable the profiling and study of complex human variations missed in most current research.

Key facts

NIH application ID
9904877
Project number
1U01HG010961-01
Recipient
DANA-FARBER CANCER INST
Principal Investigator
Heng Li
Activity code
U01
Funding institute
NIH
Fiscal year
2020
Award amount
$800,000
Award type
1
Project period
2020-03-01 → 2023-02-28