# Using Syntenic Gapped-kmer Composition to Detect Conserved Enhancers where Sequence Alignment Fails

> **NIH NIH R56** · JOHNS HOPKINS UNIVERSITY · 2021 · $388,975

## Abstract

While thousands of genomic loci have been associated with common human disease, translating
these disease associations to clinical impact is often limited by lack of appropriate human cell lines
and our inability to unambiguously identify orthologous regulatory regions for studies in model
organisms. Sequence alignment, the most frequently used method to detect evolutionary
conservation of enhancers and promoters, accurately identifies genes and promoters, but often fails
to detect many functionally conserved distal enhancers, in spite of the fact that orthologous TFs
generally have conserved binding activity. We have developed a machine learning approach to
identify the set of transcription factor binding sites (TFBS) active in cell-type specific enhancers from
epigenomic data using a gapped-kmer features that encompasses all possible TFBS (gkm-SVM), and
this classifier can distinguish active enhancers from non-active regions. This DNA sequence based
model can accurately predict enhancer activity in massively parallel reporter assays and the impact of
variation in regulatory elements associated with human disease (deltaSVM). The central aim of this
proposal is to develop a new computational method using the syntenic gapped-kmer composition to
detect functionally conserved regulatory elements missed by conventional sequence alignment
methods. Previous studies of enhancer evolution across mammals using sequence alignment have
reported that promoters are more conserved than enhancers, and that enhancers are evolving
rapidly. In contrast, using gapped k-mers, we find that cell-type specific enhancers and promoters in
matched ENCODE/Roadmap tissues are equally functionally conserved, and that gapped k-mers can
identify conserved enhancers that are undetectable by sequence alignment. We hypothesize that the
improvements relative to sequence alignment methods arise because the gapped-kmer feature space
is able to detect similarity between rearrangements and variations of TF binding sites which may vary
at gapped positions but which retain similar binding affinities. We will develop a method to detect
conserved regulatory regions using the gkm-SVM kernel as a metric of sequence conservation and
optimize this method by comparing to genome-wide functional data. We will then develop algorithms
to detect long range syntenic intervals of similar gapped k-mer composition and generate genome-
wide maps of evolutionary conservation. We will validate the predictions with CRISPRi in human and
mouse stem cells differentiated to endoderm.

## Key facts

- **NIH application ID:** 10480233
- **Project number:** 1R56HG012110-01
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Michael A Beer
- **Activity code:** R56 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $388,975
- **Award type:** 1
- **Project period:** 2021-09-24 → 2024-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10480233

## Citation

> US National Institutes of Health, RePORTER application 10480233, Using Syntenic Gapped-kmer Composition to Detect Conserved Enhancers where Sequence Alignment Fails (1R56HG012110-01). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10480233. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
