Multi-view self-supervised deep learning for biological sequences and beyond

NIH RePORTER · NIH · R35 · $391,250 · view on reporter.nih.gov ↗

Abstract

Project Abstract The breadth and depth of deep learning (DL) in solving fundamental biological problems have been demonstrated. DL-based approaches, such as AlphaFold2 for 3D protein structure prediction, have become widely accepted by the biology community. The Xu lab has been at the forefront of developing novel DL algorithms, software, and information systems for diverse biological and medical problems. During the current project period, the Xu lab has made excellent progress in addressing some of the urgent challenges and needs for developing DL methods in biological sequence analyses and predictions, as well as other bioinformatics problems. This R35 project has produced 31 papers covering research topics ranging from protein sequence- based predictions to drug design, molecular dynamics simulation, and single-cell data analysis. In addition, it also provided more than ten open-source tools and three major web-based resources to the community. The rapid development of new DL techniques and Xu lab’s accumulating expertise in this field bring new opportunities in shaping DL to molecular biology. The current widely used supervised DL methods in biomedical research often do not have sufficient data with clean and accurate labels for training and may not have good generalizability. The emerging self-supervised learning (SSL) approaches that aim to learn informative representations by exposing relationships between different data perspectives without human annotations are becoming a new trend. Different data perspectives are broadly called multiview. The multi-view SSL techniques allow us to generate joint or coordinated representations for single modal and multimodal data with stronger generalizability, better robustness, and less bias. Though SSL has demonstrated great successes in other fields, it has only been minimally explored in biology. This renewal project will develop a multi-view SSL framework that can handle both single-view and multi- view data and is capable of single and multiple tasks. It will tackle key challenges and bottlenecks in applying SSL for biological studies, such as selecting effective views and data augmentations, fusing multimodal data or data from heterogeneous sources, and integrating biological constraints into SSL models. We will focus on designing a biology-informed system, enhancing generalizability and robustness, and making the results biologically interpretable and confidence assessable. The Xu lab will apply and refine the framework to multiple mainstream biology applications, including anti-CRISPR protein prediction, by exploring various data augmentation methods for protein sequences, ion and small ligand binding prediction using complementary views of protein sequences and structures, and single-cell data analyses across different conditions. The framework will also be tested for broad applications in sequence-based studies and beyond, such as alignment- free methods for constructing phylogenetic trees ...

Key facts

NIH application ID: 10895278
Project number: 5R35GM126985-07
Recipient: UNIVERSITY OF MISSOURI-COLUMBIA
Principal Investigator: DONG XU
Activity code: R35
Funding institute: NIH
Fiscal year: 2024
Award amount: $391,250
Award type: 5
Project period: 2018-05-01 → 2028-07-31