Improved metadata authoring to enhance AI/ML readiness of associated datasets

NIH RePORTER · NIH · R01 · $274,507 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY/ABSTRACT This proposal is submitted to supplement grant R01 LM013498-01, “The Metadata Powerwash—Integrated tools to make biomedical data FAIR.” The parent grant proposes to study AI methods to standardize the metadata in online datasets to make the corresponding data findable, accessible, interoperable, and reusable, and thus “AI-ready.” The goal of the parent grant is to transform the metadata that annotate experimental datasets online to a form that adheres to formal reporting guidelines and that uses terms from standard ontologies and common data elements from NIH repositories. The research depends on technology known as CEDAR, which manages a library of metadata templates that correspond to reporting guidelines that define the expected attribute–value pairs in standard metadata descriptions. The Metadata Powerwash uses these CEDAR metadata templates to suggest what elements from standard reporting guidelines might have been intended by the idiosyncratic entries that scientists often use when they author metadata. The CEDAR technology, while widely used and extremely successful, is already 7 years old and in need of modernization. Enhancements to CEDAR will have obvious benefits to the parent grant. CEDAR uses its library of metadata templates to assist scientists when they author new metadata to describe the datasets that result from their experiments. The system ensures that the new metadata are adherent to appropriate standards whenever possible. CEDAR is slated to be included as part of the cloud-based Data Hub for the NIH RADx program, which supports a wide range of studies in the area of diagnostic testing for COVID-19. Unfortunately, CEDAR is not cloud-ready. Thus, if CEDAR is to play an optimal role in enhancing the AI-readiness of NIH RADx data, then ideally additional work is necessary. To advance the role of CEDAR in the creation of AI-ready datasets, (1) we will make CEDAR cloud-native by containerizing all CEDAR microservices, by making these microservices discoverable and observable, and by migrating the entire system to the cloud, and (2) we will make CEDAR a highly available system that is easy to maintain and evolve; we will simplify and enhance the system’s architecture, taking advantage of new approaches and components that were not available to us when the system was first designed. As a result, CEDAR will be much more scalable, maintainable, and deployable. The new architecture will advance the application of AI techniques not only to RADx data, but also to a wide range of datasets of importance to the NIH.

Key facts

NIH application ID
10592638
Project number
3R01LM013498-02S1
Recipient
STANFORD UNIVERSITY
Principal Investigator
Mark A Musen
Activity code
R01
Funding institute
NIH
Fiscal year
2022
Award amount
$274,507
Award type
3
Project period
2021-05-01 → 2025-01-31