# Using machine learning techniques to characterize the Metabolomics Workbench Dataset

> **NIH NIH R03** · TUFTS UNIVERSITY MEDFORD · 2020 · $263,120

## Abstract

PROJECT SUMMARY/ABSTRACT
 Mass spectrometry in combination with chromatography provides a powerful approach to characterize
small molecules produced in cells, tissues and other biological systems. In essence, measured metabolites
provide a functional readout of cellular state, allowing novel biological studies that advance our understanding
of health and disease. Currently, the main bottleneck in metabolomics is determining the chemical identities
associated with the spectral signatures of measured masses. Despite the growth of spectral databases and
advances in annotation tools that recommend the chemical structure that best explains each signature, the large
majority of measured masses cannot be assigned a chemical identity. There is now consensus that gleaning
partial information regarding the measured spectra in terms of chemical substructure or chemical classification
can inform biological studies. This consensus is reflected in the newly updated reporting standards for metabolite
annotation as proposed by the Metabolite Identification Task Group of the Metabolomics Society. As we show
in our Preliminary Results, spectral characterization results in “features” that can enhance performance in
machine-learning tasks such as annotation.
 This work aims to enhance the use and value of the metabolomics dataset in Metabolomics Workbench
by: (1) developing machine-learning tools trained on this dataset to characterize unknown spectra, and (2) adding
characterization information to the Metabolomics Workbench dataset. In Aim 1, we identify spectral patterns
(motifs) that can represent chemically meaningful groupings of peaks within the spectra (e.g., peaks associated
with aromatic substructures, loss of a substructure fragment, etc.). We utilize neural topic models that use
variational inference to identify such motifs. We expect such models to offer computational speedups and to
identify more chemically coherent motifs when compared to earlier implementations of topic modeling. We
generate motifs across all spectra in the Metabolomics Workbench and provide annotations for each spectrum.
 In Aim 2, we map spectral signatures to chemical ontology classes. As ontologies are hierarchical and
as a molecule can be associated with multiple classes at different hierarchical levels of an ontology, we cast this
mapping problem as a hierarchical multi-label classification problem and use neural networks to implement such
a classifier. The classifier will be trained using the Metabolomics Workbench dataset. Learned motifs from Aim
1 will be used as additional input features to improve classification. We expect that the developed classifier can
be used by others to elucidate measurements of unidentified molecules with chemical ontology classes, or to
generate ontology terms that can be used as features in downstream machine-learning tasks.

## Key facts

- **NIH application ID:** 10111982
- **Project number:** 1R03OD030601-01
- **Recipient organization:** TUFTS UNIVERSITY MEDFORD
- **Principal Investigator:** Soha Hassoun
- **Activity code:** R03 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $263,120
- **Award type:** 1
- **Project period:** 2020-09-15 → 2022-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10111982

## Citation

> US National Institutes of Health, RePORTER application 10111982, Using machine learning techniques to characterize the Metabolomics Workbench Dataset (1R03OD030601-01). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10111982. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
