# DOCKET: accelerating knowledge extraction from biomedical data sets

> **NIH NIH OT2** · INSTITUTE FOR SYSTEMS BIOLOGY · 2022 · $609,144

## Abstract

Component type: This Knowledge Provider project will continue and significantly extend work
done by the Translator Consortium Blue Team, focusing on deriving knowledge from real-world
data through complex analytic workflows, integrated to the Translator Knowledge Graph, and
served via tools like Big GIM and the Translator Standard API.
The problem: We aim to solve the “first mile” problem of translational research: how to
integrate the multitude of dynamic small-to-large data sets that have been produced by the
research and clinical communities, but that are in different locations, processed in different
ways, and in a variety of formats that may not be mutually interoperable. Integrating these data
sets requires significant manual work downloading, reformatting, parsing, indexing and
analyzing each data set in turn. The technical and ethical challenges of accessing diverse
collections of big data, efficiently selecting information relevant to different users’ interests, and
extracting the underlying knowledge are problems that remain unsolved. Here, we propose to
leverage lessons distilled from our previous and ongoing big data analysis projects to develop a
highly automated tool for removing these bottlenecks, enabling researchers to analyze and
integrate many valuable data sets with ease and efficiency, and making the data FAIR [1].
Plan: (AIM 1) We will analyze and extract knowledge from rich real-world biomedical data sets
(listed in the Resources page) in the domains of wellness, cancer, and large-scale clinical
records. (AIM 2) We will formalize methods from Aim 1 to develop DOCKET, a novel tool for
onboarding and integrating data from multiple domains. (AIM 3) We will work with other teams
to adapt DOCKET to additional knowledge domains. ■ The DOCKET tool will offer 3 modules:
(1) DOCKET Overview: Analysis of, and knowledge extraction from, an individual data set. (2)
DOCKET Compare: Comparing versions of the same data set to compute confidence values,
and comparing different data sets to find commonalities. (3) DOCKET Integrate: Deriving
knowledge through integrating different data sets. ■ Researchers will be able to parameterize
these functions, resolve inconsistencies, and derive knowledge through the command line,
Jupyter notebooks, or other interfaces as specified by Translator Standards. ■ The outcome will
be a collection of nodes and edges, richly annotated with context, provenance and confidence
levels, ready for incorporation into the Translator Knowledge Graph (TKG). ■ All analyses and
derived knowledge will be stored in standardized formats, enabling querying through the
Reasoner Std API and ingestion into downstream AI assisted machine learning. ■ Example
questions this will allow us to address include: (Wellness) Which clinical analytes, metabolites,
proteins, microbiome taxa, etc. are significantly correlated, and which changing analytes predict
transition to which disease? [2,3] (Cancer) Which gene mutations in any of X pat...

## Key facts

- **NIH application ID:** 10548024
- **Project number:** 3OT2TR003443-01S2
- **Recipient organization:** INSTITUTE FOR SYSTEMS BIOLOGY
- **Principal Investigator:** Gwênlyn Glusman
- **Activity code:** OT2 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $609,144
- **Award type:** 3
- **Project period:** 2020-01-24 → 2022-11-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10548024

## Citation

> US National Institutes of Health, RePORTER application 10548024, DOCKET: accelerating knowledge extraction from biomedical data sets (3OT2TR003443-01S2). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10548024. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
