Data analysis tools for leveraging massive public data to improve hypothesis-driven research

NIH RePORTER · NIH · R35 · $26,837 · view on reporter.nih.gov ↗

Abstract

Project summary There is a crisis of reproducibility and replicability of scientiﬁc results. This crisis is an increasing source of concern both in the scientiﬁc and popular press. The crisis is so acute that the United States Congress is currently investigating reproducibility of the scientiﬁc process. At the heart of this crisis is a collection of problems including small-sample sizes, under-powered studies, under-trained data analysts and an inability to directly leverage prior results in the statistical analysis of smaller, hypothesis-driven experiments using high-throughput technologies. Advances in technology have dramatically reduced the cost and diﬃculty of collecting high-throughput molecular data. Large collections of raw data are increasingly publicly available but are usually incorporated into individual analyses by NIGMS and other investigators on an ad-hoc basis. Meanwhile, the other costs of running a designed, hypothesis-driven study have not decreased at the same speed with technological advances. It is still expensive to identify, recruit, collect, and follow up samples even if the high-throughput measurements themselves are cheap. Despite the incredible amount of available public data, it is still common practice to perform statistical inference in these hypothesis-driven experiments study-by-study, only indirectly including previous data, estimates, and results. So ﬁndings from these studies may be highly variable, unreliable, or unreplicable. Our group has focused on developing statistical methods, data resources, and software and training that allow researchers to borrow strength empirically from public repositories, large-scale data generation projects, and crowd-sourced data to improve inference in individual, hypothesis driven studies. We propose to build on our work in developing statistical data sources, methods, software and training that facilitate and speed the work of our biological and medical collaborators. The result will be a research community that can take advantage of public data already collected at a large cost to the NIH to improve power, reduce required sample sizes, and improve replication in many new hypothesis driven molecular studies of development and disorder.

Key facts

NIH application ID: 10330636
Project number: 1R35GM144128-01
Recipient: JOHNS HOPKINS UNIVERSITY
Principal Investigator: Jeffrey T. Leek
Activity code: R35
Funding institute: NIH
Fiscal year: 2022
Award amount: $26,837
Award type: 1
Project period: 2022-04-01 → 2022-05-14