# Exploration of DNA functionality using language models

> **NIH NIH DP2** · UNIVERSITY OF FLORIDA · 2024 · $1,312,212

## Abstract

Project Summary
Deoxyribonucleic acid (DNA), carrying genetic instructions for the growth, functioning, and
reproduction of most living organism, presents a critical avenue for exploring evolutionary
history, disease etiology, personalized medical interventions, and beyond. However, our grasp of
the role played by DNA sequences, especially non-coding DNA, remains limited. Unraveling this
mystery proves challenging due to the labor-intensive and resource-demanding nature of
experiments required for its decoding. Although computational techniques have arisen, they
grapple with obstacles such as insufficient training data to unlock the secretes within DNA
sequences. Inspired by recent advancements in natural language processing, we recognize
significant prospects for employing a similar approach to study DNA sequences as a form of
biological language. Our central concept revolves around developing an advanced DNA
language model. Aligning with the trajectories of its linguistic and protein structural counterparts,
our model holds the potential to reinvigorate the research of DNA structure and functionality.
This model serves as a versatile tool, enabling the exploration of various functions residing
within DNA sequences using an innovative multi-task learning architecture. Our initial
exploration focuses on unveiling the fundamental mechanisms driving DNA regulation. The
model is adaptable to various DNA functions under the multi-task learning framework.
Additionally, the model facilitates the understanding of the functional impacts of non-coding
variants, which has profound implications for genetic testing. Our innovative framework could
significantly expand the scope of variants that can be reported in genetic testing, thereby
enhancing our ability to identify genetic contributions to health and disease and guiding
personalized medical interventions.

## Key facts

- **NIH application ID:** 10909623
- **Project number:** 1DP2LM014811-01
- **Recipient organization:** UNIVERSITY OF FLORIDA
- **Principal Investigator:** Xiao Fan
- **Activity code:** DP2 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $1,312,212
- **Award type:** 1
- **Project period:** 2024-09-01 → 2027-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10909623

## Citation

> US National Institutes of Health, RePORTER application 10909623, Exploration of DNA functionality using language models (1DP2LM014811-01). Retrieved via AI Analytics 2026-05-21 from https://api.ai-analytics.org/grant/nih/10909623. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
