# The Metadata Powerwash - Integrated tools to make biomedical data FAIR

> **NIH NIH R01** · STANFORD UNIVERSITY · 2024 · $334,475

## Abstract

Project Summary
 The metadata that describe scientific data are fundamental resources to enable (1) the
discovery and reuse of the data and (2) the reproducibility of the experiments that generated the
data in the first place. Metadata are essential for scientists to understand the associated data
and to reuse them, as well as for information technology to index the data, to make the data
available, and to provide filters for scientists to search for the corresponding datasets.
Currently, the scientific metadata hosted in public repositories suffer from multiple quality issues
that limit scientists’ ability to find and reuse the experimental datasets to which they refer. It can
take many weeks of a scientist’s time to identify a collection of datasets that fulfill specific
criteria when the data are so poorly described—and the majority of the process is necessarily
manual.
 We propose to develop an end-to-end solution to standardize biomedical metadata with the
help of ontologies—data structures that define the terms in an application domain and the
relationships among them. There are hundreds of ontologies that provide standard terms for
use in biomedicine, and they are essential resources to make biomedical metadata
interoperable and reusable. Our approach also will build on the technology created by the
Center for Expanded Data Annotation and Retrieval (CEDAR), which offers a library of building
blocks and common data elements for defining computer-based metadata templates based on
community standards.
 Our plan involves three specific aims. First, we will develop a method and tool to standardize
the multiple, ad hoc metadata field names that may appear in metadata to represent the same
type of information by replacing those field names with the field names used in standard
metadata templates or, if no appropriate template match is available, with terms from a relevant
ontology. Second, we will develop methods and tools to standardize different types of metadata
field values, for example, categorical values such as drugs or diseases, and numerical values
such as age, or sample collection date. Third, we will evaluate the speed, precision, and recall
of our metadata transformation pipeline—built out of the methods and tools to standardize field
names and values—on a large corpus of metadata that we will manually curate based on
existing public metadata. We will also carry out experiments to test the effect of the
standardized metadata when biomedical scientists perform dataset search in the context of their
work.

## Key facts

- **NIH application ID:** 10764899
- **Project number:** 5R01LM013498-04
- **Recipient organization:** STANFORD UNIVERSITY
- **Principal Investigator:** Mark A Musen
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $334,475
- **Award type:** 5
- **Project period:** 2021-05-01 → 2026-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10764899

## Citation

> US National Institutes of Health, RePORTER application 10764899, The Metadata Powerwash - Integrated tools to make biomedical data FAIR (5R01LM013498-04). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10764899. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*