# Interpretable and extendable deep learning model for biological sequence analysis and prediction

> **NIH NIH R35** · UNIVERSITY OF MISSOURI-COLUMBIA · 2021 · $234,750

## Abstract

SUMMARY
Single-cell sequencing technologies provide great opportunities for studying biology and medicine, but
computational analyses are often the bottlenecks to reveal biological insights and define cellular heterogeneity
underlying the data. The applications of machine learning (ML), especially deep learning hold great promises
to address the challenges. While ML studies from various labs, including the PI’s lab, have made significant
progress along this line, the involvement of the ML community in single-cell data analysis is limited due to the
barriers of technology complexity and biology knowledge. To attract more ML experts into this field, the PI
proposes to make large-scale single-cell sequencing data ML-ready and provide an ML-friendly development
environment. Specific aims include: (1) Collect, process, and manage diverse single-cell sequencing data
to make them ML-ready. We will collect single-cell sequencing data from public sources and convert them
into formats efficient for storage and handling. The data will be processed with multiple options, such as
imputation, normalization, and dimension reduction using a pipeline to be developed. (2) Configure the data
into benchmarks. We will use the collected data to build benchmarks, gather public benchmarks, and
encourage the community to submit their benchmarks. The data will be divided into training, validation, and
test sets in multiple settings, including a minimum viable benchmark to assist efficient method development
and a comprehensive benchmark for full evaluations. We will develop utilities to evaluate results based on a
set of assessment measures, and generate detailed reports. We will select a set of public tools to run them on
the benchmarks as baselines for others to compare with. (3) Provide an integrated development
environment (IDE) to support partial method development. We will build an IDE for single-cell sequencing
analysis method development with plug-and-play features at the code level and web interface for ML
researchers to contribute and test any minimum new ideas. A report will be provided containing evaluation
metrics and usage of computer resources, comparisons with some public tools, and downstream visualization
and interpretation. The newly formatted data, the benchmarks, and the method development and assessment
environment will be available at GitHub and the in-house single-cell data analysis web portal DeepMAPS. The
proposed research is a natural extension of the parent grant (R35-GM126985), which aims to develop deep-
learning algorithms, tools, web resources for analyses and predictions of biological sequences, including (1)
developing general unsupervised representations and making deep-learning models interpretable for
understanding biological mechanisms and generating hypotheses; (2) applying deep-learning models to a wide
range of bioinformatics problems, and (3) making the data, models, and tools freely accessible to the research
community. Thanks to...

## Key facts

- **NIH application ID:** 10409152
- **Project number:** 3R35GM126985-04S1
- **Recipient organization:** UNIVERSITY OF MISSOURI-COLUMBIA
- **Principal Investigator:** DONG XU
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $234,750
- **Award type:** 3
- **Project period:** 2018-05-01 → 2023-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10409152

## Citation

> US National Institutes of Health, RePORTER application 10409152, Interpretable and extendable deep learning model for biological sequence analysis and prediction (3R35GM126985-04S1). Retrieved via AI Analytics 2026-05-21 from https://api.ai-analytics.org/grant/nih/10409152. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
