Exploration of DNA functionality using language models

NIH RePORTER · NIH · DP2 · $1,312,212 · view on reporter.nih.gov ↗

Abstract

Project Summary Deoxyribonucleic acid (DNA), carrying genetic instructions for the growth, functioning, and reproduction of most living organism, presents a critical avenue for exploring evolutionary history, disease etiology, personalized medical interventions, and beyond. However, our grasp of the role played by DNA sequences, especially non-coding DNA, remains limited. Unraveling this mystery proves challenging due to the labor-intensive and resource-demanding nature of experiments required for its decoding. Although computational techniques have arisen, they grapple with obstacles such as insufficient training data to unlock the secretes within DNA sequences. Inspired by recent advancements in natural language processing, we recognize significant prospects for employing a similar approach to study DNA sequences as a form of biological language. Our central concept revolves around developing an advanced DNA language model. Aligning with the trajectories of its linguistic and protein structural counterparts, our model holds the potential to reinvigorate the research of DNA structure and functionality. This model serves as a versatile tool, enabling the exploration of various functions residing within DNA sequences using an innovative multi-task learning architecture. Our initial exploration focuses on unveiling the fundamental mechanisms driving DNA regulation. The model is adaptable to various DNA functions under the multi-task learning framework. Additionally, the model facilitates the understanding of the functional impacts of non-coding variants, which has profound implications for genetic testing. Our innovative framework could significantly expand the scope of variants that can be reported in genetic testing, thereby enhancing our ability to identify genetic contributions to health and disease and guiding personalized medical interventions.

Key facts

NIH application ID
10909623
Project number
1DP2LM014811-01
Recipient
UNIVERSITY OF FLORIDA
Principal Investigator
Xiao Fan
Activity code
DP2
Funding institute
NIH
Fiscal year
2024
Award amount
$1,312,212
Award type
1
Project period
2024-09-01 → 2027-08-31