Methods for sequencing data analysis and archive-scale data science

NIH RePORTER · NIH · R35 · $514,058 · view on reporter.nih.gov ↗

Abstract

PROJECT SUMMARY We will develop methods and maintain software that make it radically easier for biomedical researchers to use and understand sequencing data. The project will support our maintaining and improving our popular “upstream” tools for analyzing sequencing data. These include the Bowtie and Bowtie 2 tools for read alignment, the Kraken 2 tool for metagenomics classification and the Dashing tool for genomic sketching and comparison. We will also develop new systems that allow researchers to use these same core tools (Bowtie, Kraken 2, Dashing) to rapidly discover and vet archived datasets. We will enable researchers to quickly ascertain whether a dataset is of high quality, what species are present, whether contaminants are present, what assay was performed, what datasets are similar to each other, and what datasets are inconsistent with annotated metadata. In this way, researchers can distill relevant archived datasets, those having the expected biological properties, in a way that does not hinge on the accuracy of the associated metadata. Finally, we will work to develop new infrastructure for large-scale reanalysis and indexing of archived data, ultimately yielding new “search engines” for scientific question-answering. In particular, we will extend our past work on the Rail-RNA, recount2 and Snaptron so that we can more effectively analyze huge collections of archived data, converting them into a variety of useful summary forms, and than adding a layer of indexing so that users can query the summaries in the context of a scientific investigation. We will also create new catalogs and mechanisms whereby researchers can share their archive-assisted study designs, so that useful combinations of archived datasets, and insights into where their metadata might be incorrect or incomplete, can be reported and shared.

Key facts

NIH application ID
10757405
Project number
5R35GM139602-04
Recipient
JOHNS HOPKINS UNIVERSITY
Principal Investigator
Benjamin Thomas Langmead
Activity code
R35
Funding institute
NIH
Fiscal year
2024
Award amount
$514,058
Award type
5
Project period
2021-01-01 → 2025-12-31