# Novel methods for large-scale genomic interval comparison

> **NIH NIH R01** · UNIVERSITY OF VIRGINIA · 2024 · $376,293

## Abstract

ABSTRACT
Epigenome data are driving discovery in biomedical analysis of genetic variation and gene regulation. Epigenome
data produced by experimental protocols such as ATAC-seq or ChIP-seq are often summarized into sets of
genomic intervals deﬁned by a chromosome plus start and end coordinates. Databases now provide hundreds
of thousands of such region sets, each containing potentially hundreds of thousands of individual regions. This
data holds tremendous promise to understand gene regulation and disease because many health outcomes are
affected by genetic variation or epigenetic perturbation in regulatory DNA.
Many different tools and methods have been developed to assess such sets of genomic intervals. These ap-
proaches are used for a broad array of biomedical research, such as annotating genetic variation associated
with disease traits. Supporting region-based analyses, we and others have developed novel data structures and
algorithms to compare similarity of region sets and to compute overlaps between interval sets, enabling interval
comparisons on millions of regions. But as the genomic interval set data sources grow in size and scope, we
require both faster algorithms and novel methods to compare this important data type.
As the amount of available data increases, it is becoming intractable to compute exact overlaps. Furthermore,
the fastest algorithms only analyze pure intervals, not signal values, which could be used to compare interval sets
more accurately. Existing approaches have made little progress in the area of deﬁning canonical interval sets to
simplify analysis even further.
Here, we address these limitations in several ways: First, we develop novel, more scalable algorithms using
approximate computations and deﬁne the idea of interval set universes to consolidate analysis. Second, we
develop an innovative approach to analyzing region sets that goes beyond simply counting overlaps, instead
relying on cutting-edge machine learning methods to learn and measure similarity more accurately. We propose
a novel set theoretic approach building on techniques from natural language processing to compare intervals.
Together, we propose a ﬁrst-pass ﬁlter that can be reasonably computed on data sets containing billions to trillions
of genomic intervals, followed by a more accurate analysis to identify more subtle relationships among region sets.
These advances will improve both the efﬁciency and accuracy of existing biomedical research approaches, and
open the door to new ways of exploring the vast and growing corpus of genome interval data.

## Key facts

- **NIH application ID:** 10853036
- **Project number:** 5R01HG012558-03
- **Recipient organization:** UNIVERSITY OF VIRGINIA
- **Principal Investigator:** Nathan Sheffield
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $376,293
- **Award type:** 5
- **Project period:** 2022-08-10 → 2026-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10853036

## Citation

> US National Institutes of Health, RePORTER application 10853036, Novel methods for large-scale genomic interval comparison (5R01HG012558-03). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10853036. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
