# Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles

> **NIH NIH R01** · ARIZONA STATE UNIVERSITY-TEMPE CAMPUS · 2022 · $595,707

## Abstract

Project Summary
In response to the COVID-19 pandemic, scientists have published over one hundred thousand research articles
and made available over eight hundred thousand virus genome sequences. These sequences, along with their
metadata, can be used to understand virus evolution and spread and their implications for public health, a field of
study called genomic epidemiology. However, these sequence records do not typically contain patient metadata
such as demographics, clinical severity, or comorbidities, preventing researchers from uncovering trends in
population health. To understand the severity of the problem, we analyzed nearly 748 thousand SARS-CoV-2
records from GISAID and 60 thousand from GenBank for the presence of patient metadata finding age and
gender were represented in < 1% of GenBank records and in GISAID, 26% included sex, and 24% had age. For
other fields, the amount of missing data is even more pronounced, with neither resource providing information on
a patient's race and only GISAID specifying severity (i.e. ICU) in less than 5% of records. To address missing
virus metadata, researchers could utilize the publication associated with the new sequences, however, the virus
sequence record is often never updated with a link to the publication. From the set of records that we analyzed,
3.4% (of 748K) in GISAID and < 1% (of 117K) in GenBank had a link to a publication. This greatly hinders
secondary data analysis of these sequences and limits the ability to use them at scale to uncover associations
between the viral genome, transmission risk, and health outcomes. The goal of this proposal is to enhance
genomic epidemiology and population health of COVID-19 with a framework to continuously and automatically
enrich SARS-CoV-2 nucleic acid sequence metadata in public databases such as GenBank and GISAID with
metadata in associated published articles. We will incorporate input from clinicians at the front-line of patient
care during the pandemic and build on our NIH funded work (R01AI117011), which used Natural Language
Processing (NLP) to enrich the geographic metadata of a sequence record using its corresponding published
article. We have used these data in virus phylogeographic models and shown the benefit of using enriched
metadata for modeling virus evolution and spread. Theavailability of SARS-CoV-2 sequences, paired withfull-
text COVID-19 articles and preprints, presents an opportunity for metadata enrichment and scientific discovery
beyond our prior work. Our specific aims are to: (1) enrich SARS-CoV-2 sequence metadata using text extracted
from publications and (2) derive key epidemiologic insights for different patient demographics using our enriched
SARS-CoV-2 sequence dataset. We will leverage our prior joint work funded by the NIH to enable the secondary
use of enriched metadata for genomic epidemiology to improve our understanding of SARS-CoV-2 evolution and
spread among different population groups. We will disseminate t...

## Key facts

- **NIH application ID:** 10681068
- **Project number:** 7R01AI164481-02
- **Recipient organization:** ARIZONA STATE UNIVERSITY-TEMPE CAMPUS
- **Principal Investigator:** GRACIELA GONZALEZ HERNANDEZ
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $595,707
- **Award type:** 7
- **Project period:** 2022-09-01 → 2024-08-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10681068

## Citation

> US National Institutes of Health, RePORTER application 10681068, Enriching SARS-CoV-2 sequence data in public repositories with information extracted from full text articles (7R01AI164481-02). Retrieved via AI Analytics 2026-05-26 from https://api.ai-analytics.org/grant/nih/10681068. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*