Data analysis tools for leveraging massive public data to improve hypothesis-driven research

NIH RePORTER · NIH · R35 · $26,837 · view on reporter.nih.gov ↗

Abstract

Project summary There is a crisis of reproducibility and replicability of scientific results. This crisis is an increasing source of concern both in the scientific and popular press. The crisis is so acute that the United States Congress is currently investigating reproducibility of the scientific process. At the heart of this crisis is a collection of problems including small-sample sizes, under-powered studies, under-trained data analysts and an inability to directly leverage prior results in the statistical analysis of smaller, hypothesis-driven experiments using high-throughput technologies. Advances in technology have dramatically reduced the cost and difficulty of collecting high-throughput molecular data. Large collections of raw data are increasingly publicly available but are usually incorporated into individual analyses by NIGMS and other investigators on an ad-hoc basis. Meanwhile, the other costs of running a designed, hypothesis-driven study have not decreased at the same speed with technological advances. It is still expensive to identify, recruit, collect, and follow up samples even if the high-throughput measurements themselves are cheap. Despite the incredible amount of available public data, it is still common practice to perform statistical inference in these hypothesis-driven experiments study-by-study, only indirectly including previous data, estimates, and results. So findings from these studies may be highly variable, unreliable, or unreplicable. Our group has focused on developing statistical methods, data resources, and software and training that allow researchers to borrow strength empirically from public repositories, large-scale data generation projects, and crowd-sourced data to improve inference in individual, hypothesis driven studies. We propose to build on our work in developing statistical data sources, methods, software and training that facilitate and speed the work of our biological and medical collaborators. The result will be a research community that can take advantage of public data already collected at a large cost to the NIH to improve power, reduce required sample sizes, and improve replication in many new hypothesis driven molecular studies of development and disorder.

Key facts

NIH application ID
10330636
Project number
1R35GM144128-01
Recipient
JOHNS HOPKINS UNIVERSITY
Principal Investigator
Jeffrey T. Leek
Activity code
R35
Funding institute
NIH
Fiscal year
2022
Award amount
$26,837
Award type
1
Project period
2022-04-01 → 2022-05-14