A Knowledge Provider for Scruffy Sources of Metadata in Translational Medicine

NIH RePORTER · NIH · OT2 · $55,984 · view on reporter.nih.gov ↗

Abstract

An essential task for the Biomedical Data Translator is to identify scientific experiments that have been performed or that are ongoing, and to enable integration of knowledge of the experimental methods, the results, and—when available—the conclusions with other knowledge sources. Such capabilities will enable queries such as: (1) Has anyone ever performed an experiment using methods like these? (2) Has anyone performed a study where the data may support a particular conclusion? (3) Are there any clinical trials for a particular condition whose patient population is a good match for a patient whom I now need to treat? (4) What best practices are suggested by the results of current clinical trials for a particular condition? Sometimes such queries can be addressed through an analysis of the scientific literature. More often, however, the published literature does not provide the methodological details needed to address such questions—even if NLP techniques were good enough to find the answers. Publications also provide only summary statistics of the experimental results. To address the kinds of queries that are of most interest to the Translator, it is necessary to access the actual experimental data online, starting with the metadata that are intended to provide descriptions of the datasets and of the experiments that led to the collection of the data in the first place. The problem for the Translator project is that the metadata that describe most online experimental data sources are difficult for computers to find and to process. Our laboratory’s analysis of the NCBI BioSample metadata repository, for example, shows that scientists largely avoid using standard data dictionaries entirely, and—partly as a result—they are extremely sloppy when they provide metadata values [3]. (A case in point: Some 76% of the metadata values in BioSample that are intended to be Boolean are neither true nor false.) Despite all the discussion in the past few years about making online datasets Findable, Accessible, Interoperable, and Re-usable (FAIR) [14], most online datasets are not close to FAIR. Our laboratory is developing technology that can rectify errors in online metadata. Like a spell-checker for metadata, our approach will attempt to identify the intentions of metadata authors, to correct typos, and to convert free-text strings to ontology terms whenever possible [6]. Our goal is to provide a service that will transform the scruffy metadata that pervade online descriptions of biomedical experiments into a form that will allow automated discovery, integration, and secondary analysis of research results in ways that are simply not possible at present. We anticipate that the Translator will call on our service to find experimental datasets and their accompanying metadata, to perform standard analyses of such datasets, and to integrate descriptions of experiments into the evolving knowledge graph. We will evaluate the performance of our Knowledge Provider by...

Key facts

NIH application ID: 10057243
Project number: 1OT2TR003453-01
Recipient: STANFORD UNIVERSITY
Principal Investigator: Mark A Musen
Activity code: OT2
Funding institute: NIH
Fiscal year: 2020
Award amount: $55,984
Award type: 1
Project period: 2020-01-23 → 2020-04-07