# A machine-learning platform to illuminate the chemical dark matter in mass spectrometry-based metabolomics

> **NIH NIH DP5** · PRINCETON UNIVERSITY · 2024 · $391,852

## Abstract

7. PROJECT SUMMARY/ABSTRACT
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
The human body contains thousands of small molecules, and is exposed to thousands more during daily life.
This complex chemical ecosystem reflects both the endogenous metabolism of human cells, as well as
xenobiotic exposures from our diets, our gut flora, and our natural and built environments. At present, however,
the vast majority of these small molecules remain unknown. Remarkably, this gap is not due to a lack of
appropriate experimental technology: mass spectrometry-based metabolomics routinely detects thousands of
distinct chemical signals in any biological sample. However, only a small fraction of these signals are routinely
identified. The remaining profusion of unidentified chemical entities has been dubbed the “dark matter” of the
metabolome. Computational tools to shed light on this chemical dark matter could transform our understanding
of disease pathobiology, open new avenues for personalized medicine, and increase the scope and efficiency
of any metabolomic study. At the same time, true chemical dark matter must be differentiated from the variety
of technical artefacts, contaminants, and redundant forms of the same biomolecules that are also detected by
mass spectrometry. This project proposes to establish a suite of computational tools that will dramatically
advance our ability to interpret mass spectrometry-based metabolomic datasets, and thereby begin to unlock
the dark metabolome. These tools will apply emerging techniques from the field of natural language
processing, including the same large language model (LLM) architectures that power tools like ChatGPT, to
address two of the most important unmet needs in small molecule mass spectrometry. In Aim 1, we will
develop DecipherMS, a computational tool for de novo annotation of both known and unknown chemical
structures from MS/MS spectra. Despite decades of work in computational mass spectrometry, de novo
annotation of unknown molecules remains a critical gap, with virtually all existing tools designed to search in a
database of known structures. DecipherMS will overcome this gap by using language models to decode
unknown chemical structures directly from MS/MS spectra, using a novel data augmentation strategy to learn
effectively from limited training data. In Aim 2, we will develop FoundationMS, a foundation model for mass
spectrometry-based metabolomics. FoundationMS will standardize data preprocessing workflows that are
required to identify mass spectrometric signals that should be brought forward for annotation in the first place,
which will be achieved by learning from a repository-scale corpus of metabolomic data in a self-supervised
manner. The resulting model will be fine-tuned to perform common preprocessing tasks including peak picking,
retention time alignment, adduct removal, and chemical formula assignment. Both DecipherMS and
FoundationMS will be ri...

## Key facts

- **NIH application ID:** 10910517
- **Project number:** 1DP5OD036960-01
- **Recipient organization:** PRINCETON UNIVERSITY
- **Principal Investigator:** Michael Alexander Skinnider
- **Activity code:** DP5 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $391,852
- **Award type:** 1
- **Project period:** 2024-09-19 → 2029-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10910517

## Citation

> US National Institutes of Health, RePORTER application 10910517, A machine-learning platform to illuminate the chemical dark matter in mass spectrometry-based metabolomics (1DP5OD036960-01). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10910517. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
