Heuristics to evaluate biomedical and genomic knowledge bases for validity

NIH RePORTER · NIH · R01 · $480,000 · view on reporter.nih.gov ↗

Abstract

Project Summary Our overarching goal is to understand how information characterizing genes and their function can be organized, integrated, and then generalized to new contexts. This is a central question of the post-genomic era, and one that becomes ever more pressing as novel assays expand the scope, breadth, and detail of information describing gene properties. While the Gene Ontology is the most prominent and universal system for organizing gene function, hundreds of others exist, often serving specialized research interests. Most laboratories depend on the validity of some subset of this data to design new experiments or interpret their results, but their quality is hard to directly ascertain, particularly in novel or complex integrative methodologies. Based on substantial preliminary data, we hypothesize that determining robustness and specificity will provide a highly general assessment of the utility of databases. We propose to use these properties to assess the entire corpus of resources organizing gene information, as well as the methods which exploit this information, and the results that they report. Critically, determining robustness and specificity does not require validation with respect to ‘gold standard’ information. By evaluating these resources with respect to their joint specificity and robustness we determine means of integrating and organizing their data for use in novel applications. Finally, we propose to apply our improvements in quality control to better target rare but robust results where this is an experimental goal, notably rare diseases and single cell expression. The three complementary objectives in this project are to: 1. Determine the uniqueness and robustness of data characterizing gene function. We develop a formal approach for characterizing robustness and uniqueness/specificity by exploiting prior probability in the form of gene multifunctionality. We will evaluate robustness and specificity across essentially all complex and structured databases characterizing genes. These measures can be compared between databases or over time and provide a global landscape of data structure. 2. Test methods designed to exploit information describing gene function. Statistical and machine learning methods exploiting structured data will be assessed for robust and specific output. Data features driving performance in diverse applications will be identified and complementary sources of data as well as community clusters will be defined. 3. Evaluate results that depended on the use of databases describing gene function. Using a combination of text-mining and figure-mining, we will assess the ongoing literature for novel, robust, and specific gene-function associations. We will characterize and evaluate the “dark matter” of gene-function association from both the point of unannotated genes as well as incomplete functions.

Key facts

NIH application ID: 9995573
Project number: 5R01LM012736-04
Recipient: COLD SPRING HARBOR LABORATORY
Principal Investigator: Jesse Gillis
Activity code: R01
Funding institute: NIH
Fiscal year: 2020
Award amount: $480,000
Award type: 5
Project period: 2017-09-15 → 2022-08-31