Novel methods for large-scale genomic interval comparison

NIH RePORTER · NIH · R01 · $376,293 · view on reporter.nih.gov ↗

Abstract

ABSTRACT Epigenome data are driving discovery in biomedical analysis of genetic variation and gene regulation. Epigenome data produced by experimental protocols such as ATAC-seq or ChIP-seq are often summarized into sets of genomic intervals deﬁned by a chromosome plus start and end coordinates. Databases now provide hundreds of thousands of such region sets, each containing potentially hundreds of thousands of individual regions. This data holds tremendous promise to understand gene regulation and disease because many health outcomes are affected by genetic variation or epigenetic perturbation in regulatory DNA. Many different tools and methods have been developed to assess such sets of genomic intervals. These ap- proaches are used for a broad array of biomedical research, such as annotating genetic variation associated with disease traits. Supporting region-based analyses, we and others have developed novel data structures and algorithms to compare similarity of region sets and to compute overlaps between interval sets, enabling interval comparisons on millions of regions. But as the genomic interval set data sources grow in size and scope, we require both faster algorithms and novel methods to compare this important data type. As the amount of available data increases, it is becoming intractable to compute exact overlaps. Furthermore, the fastest algorithms only analyze pure intervals, not signal values, which could be used to compare interval sets more accurately. Existing approaches have made little progress in the area of deﬁning canonical interval sets to simplify analysis even further. Here, we address these limitations in several ways: First, we develop novel, more scalable algorithms using approximate computations and deﬁne the idea of interval set universes to consolidate analysis. Second, we develop an innovative approach to analyzing region sets that goes beyond simply counting overlaps, instead relying on cutting-edge machine learning methods to learn and measure similarity more accurately. We propose a novel set theoretic approach building on techniques from natural language processing to compare intervals. Together, we propose a ﬁrst-pass ﬁlter that can be reasonably computed on data sets containing billions to trillions of genomic intervals, followed by a more accurate analysis to identify more subtle relationships among region sets. These advances will improve both the efﬁciency and accuracy of existing biomedical research approaches, and open the door to new ways of exploring the vast and growing corpus of genome interval data.

Key facts

NIH application ID: 10853036
Project number: 5R01HG012558-03
Recipient: UNIVERSITY OF VIRGINIA
Principal Investigator: Nathan Sheffield
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $376,293
Award type: 5
Project period: 2022-08-10 → 2026-05-31