# Heuristics to evaluate biomedical and genomic knowledge bases for validity

> **NIH NIH R01** · COLD SPRING HARBOR LABORATORY · 2020 · $480,000

## Abstract

Project Summary
Our overarching goal is to understand how information characterizing genes and their function can be
organized, integrated, and then generalized to new contexts. This is a central question of the post-genomic era,
and one that becomes ever more pressing as novel assays expand the scope, breadth, and detail of information
describing gene properties. While the Gene Ontology is the most prominent and universal system for
organizing gene function, hundreds of others exist, often serving specialized research interests. Most
laboratories depend on the validity of some subset of this data to design new experiments or interpret their
results, but their quality is hard to directly ascertain, particularly in novel or complex integrative
methodologies. Based on substantial preliminary data, we hypothesize that determining robustness and
specificity will provide a highly general assessment of the utility of databases. We propose to use these
properties to assess the entire corpus of resources organizing gene information, as well as the methods which
exploit this information, and the results that they report. Critically, determining robustness and specificity does
not require validation with respect to ‘gold standard’ information. By evaluating these resources with respect to
their joint specificity and robustness we determine means of integrating and organizing their data for use in
novel applications. Finally, we propose to apply our improvements in quality control to better target rare but
robust results where this is an experimental goal, notably rare diseases and single cell expression.
The three complementary objectives in this project are to:
1. Determine the uniqueness and robustness of data characterizing gene function. We develop a
formal approach for characterizing robustness and uniqueness/specificity by exploiting prior probability in the
form of gene multifunctionality. We will evaluate robustness and specificity across essentially all complex and
structured databases characterizing genes. These measures can be compared between databases or over time
and provide a global landscape of data structure.
2. Test methods designed to exploit information describing gene function. Statistical and machine
learning methods exploiting structured data will be assessed for robust and specific output. Data features
driving performance in diverse applications will be identified and complementary sources of data as well as
community clusters will be defined.
3. Evaluate results that depended on the use of databases describing gene function. Using a
combination of text-mining and figure-mining, we will assess the ongoing literature for novel, robust, and
specific gene-function associations. We will characterize and evaluate the “dark matter” of gene-function
association from both the point of unannotated genes as well as incomplete functions.

## Key facts

- **NIH application ID:** 9995573
- **Project number:** 5R01LM012736-04
- **Recipient organization:** COLD SPRING HARBOR LABORATORY
- **Principal Investigator:** Jesse Gillis
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2020
- **Award amount:** $480,000
- **Award type:** 5
- **Project period:** 2017-09-15 → 2022-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/9995573

## Citation

> US National Institutes of Health, RePORTER application 9995573, Heuristics to evaluate biomedical and genomic knowledge bases for validity (5R01LM012736-04). Retrieved via AI Analytics 2026-05-27 from https://api.ai-analytics.org/grant/nih/9995573. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
