Methods For Evolutionary Genomics Analysis

NIH RePORTER · NIH · R35 · $231,674 · view on reporter.nih.gov ↗

Abstract

Summary The parent R35 research program aims to develop innovative methods and tools for the comparative analysis of molecular sequences. The focus is on creating machine-learning methods to perform big data analytics, gaining biological insights, and comparing these with traditional model-based methods in molecular evolution and phylogenetics. A key development in this program is the Evolutionary Sparse Learning (ESL) framework, designed to enhance molecular evolutionary analyses. Although ESL has been benchmarked against classical methods using high-performance computing (HPC) resources, benchmarking against advanced deep learning (DL) approaches remains infeasible due to the need for substantial computational power. To address this, we request a Graphics Processing Unit (GPU) cluster to enable DL analyses essential for advancing our research. Two major example projects highlight the need for this system. The first project focuses on discovering fragile clades and causal sequences in phylogenomics. We have developed metrics for gene-species sequence concordance and clade probability using ESL models, validated across many phylogenomic datasets. Benchmarking these ESL methods against DL approaches, such as MSA Transformer, is crucial. MSA Transformer captures phylogenetic relationships using multiple sequence alignments (MSAs) but requires refinement for orthologous protein sets, demanding a powerful GPU system. The second project aims to uncover molecular convergences that parallel organismal convergent evolution. Using ESL, we have built genetic models to understand the independent origins of traits such as C4 photosynthesis in grasses and echolocation in mammals. Benchmarking revealed that current methods, including ESL, are limited in detecting convergences involving different residues at different sites. Therefore, we are developing ESL approaches leveraging DL-generated protein embeddings to infer non-identical sequence convergence. Fine-tuning general DL models for orthologous sequences requires a dedicated GPU cluster, as existing resources are inadequate for the extensive analyses needed. The requested GPU cluster is essential for refining these DL models and conducting comprehensive analyses, enhancing the impact and scope of our parent grant. Our experienced team and institutional support ensure effective use and maintenance of the equipment, promoting continued advancements in molecular evolutionary analysis.

Key facts

NIH application ID
11099368
Project number
3R35GM139540-04S1
Recipient
TEMPLE UNIV OF THE COMMONWEALTH
Principal Investigator
Sudhir Kumar
Activity code
R35
Funding institute
NIH
Fiscal year
2024
Award amount
$231,674
Award type
3
Project period
2021-02-01 → 2026-01-31