# Data analysis tools for leveraging massive public data to improve hypothesis-driven research

> **NIH NIH R35** · JOHNS HOPKINS UNIVERSITY · 2022 · $26,837

## Abstract

Project summary
There is a crisis of reproducibility and replicability of scientiﬁc results. This crisis is an increasing source of
concern both in the scientiﬁc and popular press. The crisis is so acute that the United States Congress is currently
investigating reproducibility of the scientiﬁc process. At the heart of this crisis is a collection of problems including
small-sample sizes, under-powered studies, under-trained data analysts and an inability to directly leverage prior
results in the statistical analysis of smaller, hypothesis-driven experiments using high-throughput technologies.
Advances in technology have dramatically reduced the cost and diﬃculty of collecting high-throughput molecular
data. Large collections of raw data are increasingly publicly available but are usually incorporated into individual
analyses by NIGMS and other investigators on an ad-hoc basis. Meanwhile, the other costs of running a designed,
hypothesis-driven study have not decreased at the same speed with technological advances. It is still expensive to
identify, recruit, collect, and follow up samples even if the high-throughput measurements themselves are cheap.
Despite the incredible amount of available public data, it is still common practice to perform statistical inference
in these hypothesis-driven experiments study-by-study, only indirectly including previous data, estimates, and
results. So ﬁndings from these studies may be highly variable, unreliable, or unreplicable. Our group has focused
on developing statistical methods, data resources, and software and training that allow researchers to borrow
strength empirically from public repositories, large-scale data generation projects, and crowd-sourced data to
improve inference in individual, hypothesis driven studies. We propose to build on our work in developing
statistical data sources, methods, software and training that facilitate and speed the work of our biological and
medical collaborators. The result will be a research community that can take advantage of public data already
collected at a large cost to the NIH to improve power, reduce required sample sizes, and improve replication in
many new hypothesis driven molecular studies of development and disorder.

## Key facts

- **NIH application ID:** 10330636
- **Project number:** 1R35GM144128-01
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Jeffrey T. Leek
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $26,837
- **Award type:** 1
- **Project period:** 2022-04-01 → 2022-05-14

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10330636

## Citation

> US National Institutes of Health, RePORTER application 10330636, Data analysis tools for leveraging massive public data to improve hypothesis-driven research (1R35GM144128-01). Retrieved via AI Analytics 2026-05-23 from https://api.ai-analytics.org/grant/nih/10330636. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*