# A Knowledge Provider for Scruffy Sources of Metadata in Translational Medicine

> **NIH NIH OT2** · STANFORD UNIVERSITY · 2020 · $55,984

## Abstract

An essential task for the Biomedical Data Translator is to identify scientific experiments that
have been performed or that are ongoing, and to enable integration of knowledge of the
experimental methods, the results, and—when available—the conclusions with other knowledge
sources. Such capabilities will enable queries such as: (1) Has anyone ever performed an
experiment using methods like these? (2) Has anyone performed a study where the data may
support a particular conclusion? (3) Are there any clinical trials for a particular condition whose
patient population is a good match for a patient whom I now need to treat? (4) What best
practices are suggested by the results of current clinical trials for a particular condition?
Sometimes such queries can be addressed through an analysis of the scientific literature. More
often, however, the published literature does not provide the methodological details needed to
address such questions—even if NLP techniques were good enough to find the answers.
Publications also provide only summary statistics of the experimental results. To address the
kinds of queries that are of most interest to the Translator, it is necessary to access the actual
experimental data online, starting with the metadata that are intended to provide descriptions of
the datasets and of the experiments that led to the collection of the data in the first place.
The problem for the Translator project is that the metadata that describe most online
experimental data sources are difficult for computers to find and to process. Our laboratory’s
analysis of the NCBI BioSample metadata repository, for example, shows that scientists largely
avoid using standard data dictionaries entirely, and—partly as a result—they are extremely
sloppy when they provide metadata values [3]. (A case in point: Some 76% of the metadata
values in BioSample that are intended to be Boolean are neither true nor false.) Despite all the
discussion in the past few years about making online datasets Findable, Accessible,
Interoperable, and Re-usable (FAIR) [14], most online datasets are not close to FAIR.
Our laboratory is developing technology that can rectify errors in online metadata. Like a
spell-checker for metadata, our approach will attempt to identify the intentions of metadata
authors, to correct typos, and to convert free-text strings to ontology terms whenever possible
[6]. Our goal is to provide a service that will transform the scruffy metadata that pervade online
descriptions of biomedical experiments into a form that will allow automated discovery,
integration, and secondary analysis of research results in ways that are simply not possible at
present. We anticipate that the Translator will call on our service to find experimental datasets
and their accompanying metadata, to perform standard analyses of such datasets, and to
integrate descriptions of experiments into the evolving knowledge graph.
We will evaluate the performance of our Knowledge Provider by...

## Key facts

- **NIH application ID:** 10057243
- **Project number:** 1OT2TR003453-01
- **Recipient organization:** STANFORD UNIVERSITY
- **Principal Investigator:** Mark A Musen
- **Activity code:** OT2 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $55,984
- **Award type:** 1
- **Project period:** 2020-01-23 → 2020-04-07

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10057243

## Citation

> US National Institutes of Health, RePORTER application 10057243, A Knowledge Provider for Scruffy Sources of Metadata in Translational Medicine (1OT2TR003453-01). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10057243. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
