# Constructing a large-scale biomedical knowledge graph using all PubMed abstracts and PMC full-text articles and its applications

> **NIH NIH R21** · INSILICOM, LLC · 2024 · $143,162

## Abstract

Project Summary
The number of biomedical publications is growing at an accelerated speed. This ever-increasing
amount of scientific literature has made reading all the published articles regularly impossible
even for a very specific research area. The large volumes of scientific publication have also
made it very challenging for modern search engines to find relevant articles accurately for a
given query. Missing important prior studies in literature search can have serious consequences
such as wasting resources/time and/or making wrong scientific conclusions. Another unmet
challenge in literature search is that researchers often prefer finding articles where the queries
they use are part of the new discoveries, instead of the background knowledge in the articles.
The current search engines cannot distinguish between new discoveries and background
knowledge in an article. Related to this challenge is that it can be difficult to identify the latest
discoveries in a particular scientific area without reading all the recently published articles. To
address these challenges, one can convert unstructured text data into structured form, which can
then support highly accurate information retrieval, information integration and automated
knowledge discovery. A plausible approach for converting unstructured text into structured form
is to use named entity recognition (NER) and relation extraction (RE) methods to identify the
biological entities and extract their relations to construct knowledge graphs (KGs). KGs can link
concepts within existing research to allow researchers to find connections that may have been
difficult to discover without them. The LitCoin Natural Language Processing (NLP) Challenge
was recently organized by NCATS of NIH and NASA to spur innovation by rewarding the most
creative and high-impact uses of biomedical, publication-free text to create KGs. In addition to
entities and relations, the manually annotated dataset provided by LitCoin also contains the
annotations of relations being new discoveries or background knowledge. Our team has
participated in the challenge and ranked the first place. This application aims to apply the
methods we have developed for LitCoin to all PubMed abstracts and PMC full-text articles to
build the largest scale KG to date and develop applications on top of it. Specifically, we will (1)
develop a knowledge visualization and navigation tool combined with a deep learning-powered
search engine we developed previously; (2) develop advanced relation search functions to allow
knowledge discovery applications such as drug repurposing and adverse effect discovery; (3)
develop functions that allow users to search specifically the new discoveries in articles; and (4)
develop functions that return the latest discoveries in a scientific area for a given time period.

## Key facts

- **NIH application ID:** 10908293
- **Project number:** 5R21LM014277-02
- **Recipient organization:** INSILICOM, LLC
- **Principal Investigator:** Jinfeng Zhang
- **Activity code:** R21 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $143,162
- **Award type:** 5
- **Project period:** 2023-09-01 → 2025-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10908293

## Citation

> US National Institutes of Health, RePORTER application 10908293, Constructing a large-scale biomedical knowledge graph using all PubMed abstracts and PMC full-text articles and its applications (5R21LM014277-02). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10908293. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
