Interpretable and extendable deep learning model for biological sequence analysis and prediction

NIH RePORTER · NIH · R35 · $234,750 · view on reporter.nih.gov ↗

Abstract

SUMMARY Single-cell sequencing technologies provide great opportunities for studying biology and medicine, but computational analyses are often the bottlenecks to reveal biological insights and define cellular heterogeneity underlying the data. The applications of machine learning (ML), especially deep learning hold great promises to address the challenges. While ML studies from various labs, including the PI’s lab, have made significant progress along this line, the involvement of the ML community in single-cell data analysis is limited due to the barriers of technology complexity and biology knowledge. To attract more ML experts into this field, the PI proposes to make large-scale single-cell sequencing data ML-ready and provide an ML-friendly development environment. Specific aims include: (1) Collect, process, and manage diverse single-cell sequencing data to make them ML-ready. We will collect single-cell sequencing data from public sources and convert them into formats efficient for storage and handling. The data will be processed with multiple options, such as imputation, normalization, and dimension reduction using a pipeline to be developed. (2) Configure the data into benchmarks. We will use the collected data to build benchmarks, gather public benchmarks, and encourage the community to submit their benchmarks. The data will be divided into training, validation, and test sets in multiple settings, including a minimum viable benchmark to assist efficient method development and a comprehensive benchmark for full evaluations. We will develop utilities to evaluate results based on a set of assessment measures, and generate detailed reports. We will select a set of public tools to run them on the benchmarks as baselines for others to compare with. (3) Provide an integrated development environment (IDE) to support partial method development. We will build an IDE for single-cell sequencing analysis method development with plug-and-play features at the code level and web interface for ML researchers to contribute and test any minimum new ideas. A report will be provided containing evaluation metrics and usage of computer resources, comparisons with some public tools, and downstream visualization and interpretation. The newly formatted data, the benchmarks, and the method development and assessment environment will be available at GitHub and the in-house single-cell data analysis web portal DeepMAPS. The proposed research is a natural extension of the parent grant (R35-GM126985), which aims to develop deep- learning algorithms, tools, web resources for analyses and predictions of biological sequences, including (1) developing general unsupervised representations and making deep-learning models interpretable for understanding biological mechanisms and generating hypotheses; (2) applying deep-learning models to a wide range of bioinformatics problems, and (3) making the data, models, and tools freely accessible to the research community. Thanks to...

Key facts

NIH application ID: 10409152
Project number: 3R35GM126985-04S1
Recipient: UNIVERSITY OF MISSOURI-COLUMBIA
Principal Investigator: DONG XU
Activity code: R35
Funding institute: NIH
Fiscal year: 2021
Award amount: $234,750
Award type: 3
Project period: 2018-05-01 → 2023-04-30