# Systematic Identification of Core Regulatory Circuitry from ENCODE Data

> **NIH NIH U01** · JOHNS HOPKINS UNIVERSITY · 2021 · $572,256

## Abstract

While much progress has been made generating high quality chromatin state and accessibility data from the
ENCODE and Roadmap consortia, accurately identifying cell-type specific enhancers from these data remains a
significant challenge. We have recently developed a computational approach (gkmSVM) to predict regulatory
elements from DNA sequence, and we have shown that when gkmSVM is trained on DHS data from each
of the human and mouse ENCODE and Roadmap cells and tissues, it can predict both cell specific enhancer
activity and the impact of regulatory variants (deltaSVM) with greater precision than alternative approaches.
The gkmSVM model encapsulates a set of cell-type specific weights describing the regulatory binding
site vocabulary controlling chromatin accessibility in each cell type. A striking observation is that the significant
gkmSVM weights are generally identifiable with a small (~20) set of TF binding sites which vary by cell-type,
consistent with the hypothesis that cell-type specific expression programs are controlled by a small set
of core factors tightly coupled in mutually interacting regulatory circuits. Perturbations of these core regulators
enable transitions between stable differentiated cell-type states of this genetic circuit. Here, we will use
gkmSVM to systematically identify the core regulatory circuitry in all existing ENCODE and Roadmap human
and mouse cell lines and tissues, and produce DNA sequence based genomic regulatory maps and fine-scale
predictions of core regulator binding sites within predicted regulatory regions. We will generate binding
site models for core regulators in each cell type, assess the accuracy of our predictions through direct
experimental validation. The value of this map critically depends on its accuracy, so we demonstrate that
gkmSVM predictions consistently outperform alternative methods in massively parallel enhancer reporter and
luciferase validation assays, in blind community assessments of regulatory element predictions (CAGI), and in
predicting validated causal disease associated variants. In contrast, we show that methods using PWM
descriptions of TF binding sites are significantly less accurate. We will produce base-pair resolution predictions
of the cell specific TF binding sites (TFBS) within broader regulatory regions detected by multiple ENCODE
epigenomic Mapping datasets, and to test these TFBS predictions in collaboration with Functional
Characterization Centers (FCC). Our regulatory maps will help design and inform focused experiments
probing regulatory mechanisms, and aid in the interpretation of disease associated non-coding variants.

## Key facts

- **NIH application ID:** 10238262
- **Project number:** 3U01HG009380-04S1
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Michael A Beer
- **Activity code:** U01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $572,256
- **Award type:** 3
- **Project period:** 2017-02-01 → 2023-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10238262

## Citation

> US National Institutes of Health, RePORTER application 10238262, Systematic Identification of Core Regulatory Circuitry from ENCODE Data (3U01HG009380-04S1). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10238262. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
