Automated methods for standardization and enhancement of metadata in biomedical databases

NIH RePORTER · NIH · R01 · $332,393 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY/ABSTRACT Biomedical research data sets are increasingly being deposited in public, centralized databases, such as the Sequence Read Archive (SRA), to which researchers submit sequencing-based data. Large centralized databases greatly enable opportunities for training powerful machine learning models, as well as for reanalysis and cross-study meta-analysis of biomedical data. These analyses can be used to answer questions that were not addressed in the papers first describing the data, including those that could only be answered by aggregating data from multiple studies. Unfortunately, researchers have not been able to fully capitalize on databases of biomedical data sets largely because the metadata provided for data sets are often unstructured, unstandardized, and incomplete. For example, the primary metadata for samples with assays deposited in the SRA are provided as a list of key-value pairs, with no standardization of the keys or values and no required fields. Such poor metadata pose challenges for integrating datasets with these databases as well as for querying for specific data sets of interest. To fully enable the opportunities offered by large biomedical databases, we propose to develop automated methods for curating the metadata contained within them. These methods will standardize the metadata of a database by assigning to each record a set of standardized terms for concepts represented within biomedical ontologies and will additionally identify the relationship between each concept and record (e.g., a record’s corresponding biological sample was derived from liver tissue). A complementary set of methods will be developed to identify missing or unstandardized concepts in metadata. The developed methods will use machine learning approaches that can be trained with minimal human effort. To achieve high accuracy with sparse training data, we will take advantage of cutting-edge approaches in deep learning, natural language processing, and active learning. As a specific application of these general methods, we will use them to standardize and enhance the metadata contained within the SRA and the Gene Expression Omnibus (GEO) for the most commonly assayed species using a comprehensive set of ontology concepts and relationships. The resulting standardized metadata for the SRA and GEO will be made freely available and easily accessible via a web interface, bulk downloads, and R and Python interface packages. The developed methods, along with the standardized metadata they produce, will allow biomedical databases to be used to their full potential in advancing our understanding of fundamental biology and human health.

Key facts

NIH application ID: 10942272
Project number: 1R01LM014593-01
Recipient: UNIVERSITY OF WISCONSIN-MADISON
Principal Investigator: Colin Noel Dewey
Activity code: R01
Funding institute: NIH
Fiscal year: 2024
Award amount: $332,393
Award type: 1
Project period: 2024-08-07 → 2028-05-31