# An ensemble framework for regulatory variant prediction.

> **NIH NIH R01** · BOSTON CHILDREN'S HOSPITAL · 2024 · $565,544

## Abstract

PROJECT SUMMARY
Transcriptional cis-regulatory elements (CREs), such as enhancers and promoters, play an essential role in all
biological processes by controlling the expression of their target genes. Sequence variants in these CREs can
perturb their target gene expression by altering the binding of transcription factors (TF). It is now clear that the
substantial risk is encoded within these noncoding regulatory variants in most human disorders. However,
systematic identification of regulatory variants and their causative transcriptional machinery for human
diseases remains challenging. Over the past decade, we have pioneered to solve these important problems and
have made significant progress in developing machine-learning-based methods to predict CREs (gkm-SVM)
and regulatory variants (deltaSVM) from DNA sequence. We recently demonstrated that these regulatory
variants predicted by deltaSVM significantly contribute to the heritability of human traits and diseases in a
tissue- and cell-specific way. Here, we will extend these methodologies to further improve the discovery of
regulatory variants in the human genome and explore their contribution to human diseases and traits. Toward
this end, we will employ a two-step training approach. We will first build multiple sequence-based models to
predict regulatory variants trained on a compendium of genomic data. We will then train ensemble models to
find optimal combinations of these models to predict experimentally identified regulatory variants that exhibit
allelic imbalance in chromatin accessibility. Uniquely, we will build this model in a cell-type resolved manner
using human kidney single-cell chromatin accessibility data. Next, we will systematically assess these models
using a broad range of human traits and diseases from well-powered genome-wide association studies (GWAS).
We will then computationally identify targeted genes of these predicted regulatory variants and prioritize genes
based on their contribution to traits and diseases relevant to tissues and cells using co-localization analyses.
Lastly, we will experimentally validate these putative regulatory variants with massively parallel reporter
assays and their predicted target genes with multiple CRE deletion experiments using CRISPR-cas9. As an
exemplar, we will focus on kidney traits and use kidney relevant cell lines for these validation experiments. Our
framework will enable us to further improve regulatory variation discovery and ultimately help us better
understand how gene regulatory mechanisms are perturbed in human diseases and trait variation.

## Key facts

- **NIH application ID:** 10839818
- **Project number:** 5R01HG012871-02
- **Recipient organization:** BOSTON CHILDREN'S HOSPITAL
- **Principal Investigator:** Dongwon Lee
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $565,544
- **Award type:** 5
- **Project period:** 2023-05-10 → 2028-02-29

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10839818

## Citation

> US National Institutes of Health, RePORTER application 10839818, An ensemble framework for regulatory variant prediction. (5R01HG012871-02). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10839818. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
