Turning big data analysis infrastructure for HIV research

NIH RePORTER · NIH · R01 · $374,737 · view on reporter.nih.gov ↗

Abstract

Summary The COVID-19/SARS-CoV-2 pandemic is a once in a generation, “all-hands-on-deck” event for the scientific community. This pandemic is also the first in which real time genomic data are available, e.g. via GISAID [1], where genomic sequences are deposited daily. Vital insights about the virus and the epidemic depend on rapid and reliable genomic analysis of diverse viral sample sequences by multiple laboratories. Yet we repeatedly encounter the same avoidable shortcomings early in viral investigations, including COVID-19: lack of reproducibility, rigor, and data/analytic sharing. Only about 10% of the published genomes have quality metrics, primary data (read files), or any level of details on analytics, making these data irreproducible and unverifiable; over 40% of GISAID submissions to date provide no information about how the sequences were generated. Essential questions about the extent of intra-host genomic variability (indicative of adaptation or multiple infection), viral evolution (selection, recombination), transmission (phylogenetic and phylogeographic) cannot be answered reliably if researchers cannot trust/replicate the source data and analytical approaches. One of the key goals/deliverables of this supplement will be the open analytic workflows that can be used to curate and standardize genomic data, and high quality annotated variation data.

Key facts

NIH application ID
10148893
Project number
3R01AI134384-04S1
Recipient
PENNSYLVANIA STATE UNIVERSITY, THE
Principal Investigator
ANTON NEKRUTENKO
Activity code
R01
Funding institute
NIH
Fiscal year
2020
Award amount
$374,737
Award type
3
Project period
2020-07-09 → 2022-05-31