# Automated methods for standardization and enhancement of metadata in biomedical databases

> **NIH NIH R01** · UNIVERSITY OF WISCONSIN-MADISON · 2024 · $332,393

## Abstract

PROJECT SUMMARY/ABSTRACT
Biomedical research data sets are increasingly being deposited in public, centralized databases, such as the
Sequence Read Archive (SRA), to which researchers submit sequencing-based data. Large centralized
databases greatly enable opportunities for training powerful machine learning models, as well as for reanalysis
and cross-study meta-analysis of biomedical data. These analyses can be used to answer questions that were
not addressed in the papers first describing the data, including those that could only be answered by
aggregating data from multiple studies. Unfortunately, researchers have not been able to fully capitalize on
databases of biomedical data sets largely because the metadata provided for data sets are often unstructured,
unstandardized, and incomplete. For example, the primary metadata for samples with assays deposited in the
SRA are provided as a list of key-value pairs, with no standardization of the keys or values and no required
fields. Such poor metadata pose challenges for integrating datasets with these databases as well as for
querying for specific data sets of interest.
To fully enable the opportunities offered by large biomedical databases, we propose to develop automated
methods for curating the metadata contained within them. These methods will standardize the metadata of a
database by assigning to each record a set of standardized terms for concepts represented within biomedical
ontologies and will additionally identify the relationship between each concept and record (e.g., a record’s
corresponding biological sample was derived from liver tissue). A complementary set of methods will be
developed to identify missing or unstandardized concepts in metadata. The developed methods will use
machine learning approaches that can be trained with minimal human effort. To achieve high accuracy with
sparse training data, we will take advantage of cutting-edge approaches in deep learning, natural language
processing, and active learning. As a specific application of these general methods, we will use them to
standardize and enhance the metadata contained within the SRA and the Gene Expression Omnibus (GEO)
for the most commonly assayed species using a comprehensive set of ontology concepts and relationships.
The resulting standardized metadata for the SRA and GEO will be made freely available and easily accessible
via a web interface, bulk downloads, and R and Python interface packages. The developed methods, along
with the standardized metadata they produce, will allow biomedical databases to be used to their full potential
in advancing our understanding of fundamental biology and human health.

## Key facts

- **NIH application ID:** 10942272
- **Project number:** 1R01LM014593-01
- **Recipient organization:** UNIVERSITY OF WISCONSIN-MADISON
- **Principal Investigator:** Colin Noel Dewey
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $332,393
- **Award type:** 1
- **Project period:** 2024-08-07 → 2028-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10942272

## Citation

> US National Institutes of Health, RePORTER application 10942272, Automated methods for standardization and enhancement of metadata in biomedical databases (1R01LM014593-01). Retrieved via AI Analytics 2026-06-13 from https://api.ai-analytics.org/grant/nih/10942272. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
