# Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments - Administrative Supplement

> **NIH NIH R01** · CARNEGIE-MELLON UNIVERSITY · 2021 · $8,236

## Abstract

PROJECT SUMMARY / ABSTRACT
This proposal aims to solve the sequencing experiment discovery problem. The data from hundreds of thou-
sands of short-read sequencing experiments are now publicly available, and private collections of sequencing
experiments are also growing rapidly. These experiments include hundreds of thousands of whole genome
sequencing experiments, and tens of thousands of RNA-seq, metagenomic, and tumor sequencing samples.
However, these experiments are vastly underused, with few analyses making use of more than a handful of ex-
periments at a time and most analyses ignoring this collection of raw data entirely. One crucial reason for this is
that merely ﬁnding the appropriate experiments is a signiﬁcant barrier to their use in downstream analyses. This
is due to the lack of a computational platform that can search for relevant short-read sequencing data sets by the
sequences they contain. It is not currently possible to ﬁnd all the metagenomic experiments in which the genes
that form a particular pathway are present or to ﬁnd all experiments in which a novel lncRNA is observed. The
experiment discovery problem is that of ﬁnding — on a global scale — those experiments that are relevant to an
isoform, variant, or species under study. By building on our existing work in large-scale sequence search, we
propose to develop a new distributed platform to index and search hundreds of thousands of raw short-read se-
quencing data sets to enable researchers to quickly ﬁnd experiments that contain their query sequences. We will
apply this system to searching RNA-seq, metagenomic, and cancer tumor samples. The research questions
we will solve include how to improve the computational scaling, increase the types of biologically meaningful
queries that can be answered, and increase our ability to ﬁnd relevant experiments in situations where muta-
tions are common. We will produce a high-quality open-source implementation of the developed computational
methods. The project will signiﬁcantly expand the usefulness of large repositories of raw sequencing reads and
enabled new approaches for large-scale reanalysis and reuse of short-read experiments. The system will unlock
a rich source of biological information for gene function prediction, for understanding microbial communities, and
for connecting genetic variation with disease progression.

## Key facts

- **NIH application ID:** 10393953
- **Project number:** 3R01GM122935-04S1
- **Recipient organization:** CARNEGIE-MELLON UNIVERSITY
- **Principal Investigator:** Carleton Lee Kingsford
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $8,236
- **Award type:** 3
- **Project period:** 2017-05-01 → 2022-04-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10393953

## Citation

> US National Institutes of Health, RePORTER application 10393953, Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments - Administrative Supplement (3R01GM122935-04S1). Retrieved via AI Analytics 2026-05-25 from https://api.ai-analytics.org/grant/nih/10393953. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*