# Generating better-targeted training data for computational enhancer discovery in vector insects

> **NIH NIH R03** · STATE UNIVERSITY OF NEW YORK AT BUFFALO · 2024 · $79,256

## Abstract

Vector insects are responsible for over 700,00 deaths annually and hundreds of billions of dollars in associated
economic impact. Although many vector insect species have had their genomes sequenced, the regulatory component
of these genomes is largely undefined. Characterizing regulatory sequences—in particular, “enhancers”—is critical
for understanding the organization of gene regulatory networks and how the genome informs phenotype.
Enhancers play important roles in mediating insecticide resistance, pathogen transmission, host recognition, mating
success, and other critical vector-relevant biological processes. Enhancers also serve as major components of
biotechnology tools requiring precisely targeted gene expression, adding additional importance to their identification
and characterization. Given that enhancers are so important, but that few insect enhancers are known other than for
the fruit fly, Drosophila melanogaster, there is a pressing need for efficient methods for enhancer discovery that can
be applied to a wide range of vector species. We previously developed “SCRMshaw,” a computational method for
enhancer discovery, and used it to predict enhancers in over 36 insect species including vector mosquitoes.
SCRMshaw utilizes known enhancers as “training data” to guide its search for unknown enhancers with related
function. We have demonstrated that we can use Drosophila enhancers, of which many (~38,000) are well-
characterized, as training data to discover similar enhancers in other insect species. Unfortunately, despite this large
number of known Drosophila enhancers, training enhancers are not available for many significant cell types. This is
especially true for non-embryonic stages of the life cycle and cells of particular interest for vector biology, e.g.
those involved in insecticide resistance, pathogen transmission, and reproduction. This proposal addresses this
major shortcoming by developing a new way to generate SCRMshaw training data, using scATAC-seq data. This will
open up a broad potential source of training data for currently under-studied cell types, and enable prediction of
enhancers in species at greater evolutionary distances from Drosophila. It will also allow for significantly
improved estimation of true- and false-positive enhancer prediction rates. The proposed approach is rapid;
inexpensive; and requires empirical data from only a single representative organism but can be applied to dozens
to hundreds of loosely-related organisms. It will allow a single set of quality scATAC-seq experiments in
Drosophila or a vector mosquito to be leveraged for enhancer prediction in the majority of vector insect species.
This work will therefore provide immediate important outcomes by enabling functional regulatory annotation of
an expanded set of relevant cell types for a significant portion of sequenced vector insects. It will have a major
long-term impact on our ability to address fundamental questions of vector biology and...

## Key facts

- **NIH application ID:** 10864445
- **Project number:** 1R03AI182642-01
- **Recipient organization:** STATE UNIVERSITY OF NEW YORK AT BUFFALO
- **Principal Investigator:** MARC S HALFON
- **Activity code:** R03 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $79,256
- **Award type:** 1
- **Project period:** 2024-06-13 → 2026-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10864445

## Citation

> US National Institutes of Health, RePORTER application 10864445, Generating better-targeted training data for computational enhancer discovery in vector insects (1R03AI182642-01). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10864445. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
