# Methods for sequencing data analysis and archive-scale data science

> **NIH NIH R35** · JOHNS HOPKINS UNIVERSITY · 2022 · $514,058

## Abstract

PROJECT SUMMARY
We will develop methods and maintain software that make it radically easier for biomedical researchers to
use and understand sequencing data. The project will support our maintaining and improving our popular
“upstream” tools for analyzing sequencing data. These include the Bowtie and Bowtie 2 tools for read
alignment, the Kraken 2 tool for metagenomics classiﬁcation and the Dashing tool for genomic sketching
and comparison. We will also develop new systems that allow researchers to use these same core tools
(Bowtie, Kraken 2, Dashing) to rapidly discover and vet archived datasets. We will enable researchers to
quickly ascertain whether a dataset is of high quality, what species are present, whether contaminants
are present, what assay was performed, what datasets are similar to each other, and what datasets are
inconsistent with annotated metadata. In this way, researchers can distill relevant archived datasets, those
having the expected biological properties, in a way that does not hinge on the accuracy of the associated
metadata. Finally, we will work to develop new infrastructure for large-scale reanalysis and indexing of
archived data, ultimately yielding new “search engines” for scientiﬁc question-answering. In particular,
we will extend our past work on the Rail-RNA, recount2 and Snaptron so that we can more effectively
analyze huge collections of archived data, converting them into a variety of useful summary forms, and
than adding a layer of indexing so that users can query the summaries in the context of a scientiﬁc
investigation. We will also create new catalogs and mechanisms whereby researchers can share their
archive-assisted study designs, so that useful combinations of archived datasets, and insights into where
their metadata might be incorrect or incomplete, can be reported and shared.

## Key facts

- **NIH application ID:** 10322369
- **Project number:** 5R35GM139602-02
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Benjamin Thomas Langmead
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $514,058
- **Award type:** 5
- **Project period:** 2021-01-01 → 2025-12-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10322369

## Citation

> US National Institutes of Health, RePORTER application 10322369, Methods for sequencing data analysis and archive-scale data science (5R35GM139602-02). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10322369. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
