# Improved metadata authoring to enhance AI/ML readiness of associated datasets

> **NIH NIH R01** · STANFORD UNIVERSITY · 2022 · $274,507

## Abstract

PROJECT SUMMARY/ABSTRACT
This proposal is submitted to supplement grant R01 LM013498-01, “The Metadata Powerwash—Integrated
tools to make biomedical data FAIR.” The parent grant proposes to study AI methods to standardize the
metadata in online datasets to make the corresponding data findable, accessible, interoperable, and reusable,
and thus “AI-ready.” The goal of the parent grant is to transform the metadata that annotate experimental
datasets online to a form that adheres to formal reporting guidelines and that uses terms from standard
ontologies and common data elements from NIH repositories. The research depends on technology known as
CEDAR, which manages a library of metadata templates that correspond to reporting guidelines that define the
expected attribute–value pairs in standard metadata descriptions. The Metadata Powerwash uses these
CEDAR metadata templates to suggest what elements from standard reporting guidelines might have been
intended by the idiosyncratic entries that scientists often use when they author metadata. The CEDAR
technology, while widely used and extremely successful, is already 7 years old and in need of modernization.
Enhancements to CEDAR will have obvious benefits to the parent grant.
CEDAR uses its library of metadata templates to assist scientists when they author new metadata to describe
the datasets that result from their experiments. The system ensures that the new metadata are adherent to
appropriate standards whenever possible. CEDAR is slated to be included as part of the cloud-based Data
Hub for the NIH RADx program, which supports a wide range of studies in the area of diagnostic testing for
COVID-19. Unfortunately, CEDAR is not cloud-ready. Thus, if CEDAR is to play an optimal role in enhancing
the AI-readiness of NIH RADx data, then ideally additional work is necessary. To advance the role of CEDAR
in the creation of AI-ready datasets, (1) we will make CEDAR cloud-native by containerizing all CEDAR
microservices, by making these microservices discoverable and observable, and by migrating the entire
system to the cloud, and (2) we will make CEDAR a highly available system that is easy to maintain and
evolve; we will simplify and enhance the system’s architecture, taking advantage of new approaches and
components that were not available to us when the system was first designed. As a result, CEDAR will be
much more scalable, maintainable, and deployable. The new architecture will advance the application of AI
techniques not only to RADx data, but also to a wide range of datasets of importance to the NIH.

## Key facts

- **NIH application ID:** 10592638
- **Project number:** 3R01LM013498-02S1
- **Recipient organization:** STANFORD UNIVERSITY
- **Principal Investigator:** Mark A Musen
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $274,507
- **Award type:** 3
- **Project period:** 2021-05-01 → 2025-01-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10592638

## Citation

> US National Institutes of Health, RePORTER application 10592638, Improved metadata authoring to enhance AI/ML readiness of associated datasets (3R01LM013498-02S1). Retrieved via AI Analytics 2026-06-24 from https://api.ai-analytics.org/grant/nih/10592638. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
