The Metadata Powerwash - Integrated tools to make biomedical data FAIR

NIH RePORTER · NIH · R01 · $334,475 · view on reporter.nih.gov ↗

Abstract

Project Summary The metadata that describe scientific data are fundamental resources to enable (1) the discovery and reuse of the data and (2) the reproducibility of the experiments that generated the data in the first place. Metadata are essential for scientists to understand the associated data and to reuse them, as well as for information technology to index the data, to make the data available, and to provide filters for scientists to search for the corresponding datasets. Currently, the scientific metadata hosted in public repositories suffer from multiple quality issues that limit scientists’ ability to find and reuse the experimental datasets to which they refer. It can take many weeks of a scientist’s time to identify a collection of datasets that fulfill specific criteria when the data are so poorly described—and the majority of the process is necessarily manual. We propose to develop an end-to-end solution to standardize biomedical metadata with the help of ontologies—data structures that define the terms in an application domain and the relationships among them. There are hundreds of ontologies that provide standard terms for use in biomedicine, and they are essential resources to make biomedical metadata interoperable and reusable. Our approach also will build on the technology created by the Center for Expanded Data Annotation and Retrieval (CEDAR), which offers a library of building blocks and common data elements for defining computer-based metadata templates based on community standards. Our plan involves three specific aims. First, we will develop a method and tool to standardize the multiple, ad hoc metadata field names that may appear in metadata to represent the same type of information by replacing those field names with the field names used in standard metadata templates or, if no appropriate template match is available, with terms from a relevant ontology. Second, we will develop methods and tools to standardize different types of metadata field values, for example, categorical values such as drugs or diseases, and numerical values such as age, or sample collection date. Third, we will evaluate the speed, precision, and recall of our metadata transformation pipeline—built out of the methods and tools to standardize field names and values—on a large corpus of metadata that we will manually curate based on existing public metadata. We will also carry out experiments to test the effect of the standardized metadata when biomedical scientists perform dataset search in the context of their work.

Key facts

NIH application ID: 10764899
Project number: 5R01LM013498-04
Recipient: STANFORD UNIVERSITY
Principal Investigator: Mark A Musen
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $334,475
Award type: 5
Project period: 2021-05-01 → 2026-01-31