# Multimodal Machine-Learning and High Performance Computing Strategies for Big MS Proteomics Data

> **NIH NIH R01** · FLORIDA INTERNATIONAL UNIVERSITY · 2020 · $322,438

## Abstract

Project Abstract/Summary
Mass spectrometry (MS) data is high-dimensional data that is used for large-scale system biology
proteomics. The current state of the art mass spectrometers can generate thousands of spectra from
a single organism and experiment. This high-dimensional data is processed using database searches
and denovo algorithms with varying degrees of success. The overarching objective of this study is to
develop, test, integrate and evaluate novel image-processing and deep-learning algorithms that will
allow us to deduce and identify reliable peptide sequences in a definitive and quantitative fashion. Our
long-term goal is to improve on identification of MS based proteomics data using novel and
scalable algorithms. The objective of this proposal is to investigate, design and implement
machine-learning deep-learning algorithms for identification of peptides from MS data. Since
deep-learning is very good at discovering intricate structures in high-dimensional data it will be
ideal solution for discovering dark proteomics data and more accurate deduction of peptides. We
predict that the integration of these methods, along with traditional numerical algorithms, will lead
to a multimodal fusion-based approach for an optimized and accurate peptide deduction
system for large-scale MS data. Further, we will design and implement data augmentation,
memory-efficient indexing, and high-performance computing (HPC) to achieve these outcomes
more efficiently with a shorter computational time. Therefore, this new line of investigation is
significant since it has the potential to improve on long-stalled effort to increase accuracy,
reliability and reproducibility of MS data analysis and search tools. The proximate expected
outcome of this work is a novel set of deep-learning and image-processing tools which will allow
much better insight in MS based proteomics data. The results will have an important positive
impact immediately because these proposed research tasks will lay the groundwork to develop a
new class of algorithms and will provide rapid, high-throughput, sensitive, and reproducible and
reliable tools for MS based proteomics.

## Key facts

- **NIH application ID:** 9973317
- **Project number:** 1R01GM134384-01A1
- **Recipient organization:** FLORIDA INTERNATIONAL UNIVERSITY
- **Principal Investigator:** Fahad Saeed
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $322,438
- **Award type:** 1
- **Project period:** 2020-06-01 → 2023-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9973317

## Citation

> US National Institutes of Health, RePORTER application 9973317, Multimodal Machine-Learning and High Performance Computing Strategies for Big MS Proteomics Data (1R01GM134384-01A1). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/9973317. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*