Tooling for accurately studying the epigenome along the human pangenome reference

NIH RePORTER · NIH · U01 · $1,418,806 · view on reporter.nih.gov ↗

Abstract

Project Summary/Abstract This proposal will provide the foundational tooling for understanding the function of the pan-genome reference through the accurate annotation of regulatory elements within the pan-genome. As the genetic component of the pan-genome reference comes into focus, the next challenge is understanding the functional relevance of genetic variants within this reference. However, resolving this challenge requires tooling that enables users to: (1) get accurate epigenetic data into a pan-genome reference; and (2) use epigenetic data once it is in a pan-genome reference. This proposal leverages our team’s unique expertise in long-read epigenetics, short-read epigenetics, pan-genome assembly, and genomic software development to develop transformative tooling for threading accurate epigenetic information into a pan-genome graph, as well as extracting epigenetic information from a pan-genome in a manner that is compatible with existing epigenetic and genetic analysis tools. Our tooling is grounded in first assembling accurate epigenetic annotations at the level of haploid linear contigs, which are then threaded into a pan-genome reference. This approach significantly improves the accuracy by which both long- and short-read epigenetic features are mapped into a pan-genome, enables our tooling to readily adapt to new pan-genomes, and enables user-generated epigenetic data to be incorporated into a pan-genome reference without having to remake the pan-genome reference itself. Importantly, we are designing this tooling to work for diverse types of epigenetic data acquired across sequencing platforms. In addition, this tooling will be available through AnVIL, Conda, and other platforms, enabling users to readily adopt it into their own research pipelines. Specifically, in Aim 1 we will develop tooling that uses a semi-supervised machine learning approach to accurately classify long-read epigenetic data collected using diverse experimental methods and sequencing platforms. In Aim 2, we will develop tooling that accurately aggregates long-read epigenetic data onto haploid linear contigs, and then threads either long-read or short-read epigenetic data into a pan-genome reference. In Aim 3, we will create fundamental operation tools for processing epigenetic data within a pan-genome to identify epigenetic and genetic features at specific points of interest within a pan-genome in a sample-, path-, and read- aware manner. Finally, we will apply our tooling to existing long-read and short-read epigenetic datasets to identify genetic variants within the pan-genome reference associated with haplotype-, paralog-, and sample- specific epigenetic features.

Key facts

NIH application ID
10976065
Project number
1U01HG013744-01
Recipient
UNIVERSITY OF WASHINGTON
Principal Investigator
Andrew Ben Stergachis
Activity code
U01
Funding institute
NIH
Fiscal year
2024
Award amount
$1,418,806
Award type
1
Project period
2024-09-19 → 2027-08-31