A Training Module for Reproducible Data Science Research

NIH RePORTER · NIH · R25 · $94,168 · view on reporter.nih.gov ↗

Abstract

Abstract Scientific progress depends on the ability of scientists to communicate the details of their investigations, allowing others to learn new techniques and procedures and to critically review the process leading to any significant findings. However, this foundational aspect of the scientific process faces significant challenges. Rapid advances in computing technology have led to high-throughput data collection coupled with the application of complex statistical algorithms for data analysis. As a result, it has become nearly impossible to describe the scientific process precisely using traditional methods of communication. Compounding the problem of communicating data analytic complexity is the inability of traditional educational programs to keep up with technological and methodological changes. The shortage of data analytic skills and the corresponding lack of transparency regarding the scientific process is at the very core of the reproducibility and replication crisis in science today. In order to address the problem of scientific irreproducibility, training is needed in the fundamental aspects of good data analysis and reproducible research. Such training needs to go beyond traditional approaches which focus on developing a toolbox of statistical methods. While knowledge of tools and their properties is necessary for good data analysis, it is far from sufficient. Additional knowledge is required to combine those tools to produce a sound data analysis in a transparent manner. Furthermore, we must go beyond traditional methods of classroom learning in order to reach the entire scientific workforce. We will build training modules for improving data science research by leveraging recent work done by members of the Johns Hopkins Data Science Lab. We will focus on two primary tracks: (1) strategies for reproducible data science, which include the higher-level principles for designing good data analyses, recognizing poor data analysis, and providing a proper critique of a data analysis; and (2) technologies and workflows, which cover the software tools for doing data analysis in a reproducible, distributable, and reusable manner. The materials developed in this project will supplement traditional training programs in biomedical data science fields and will be made entirely open source for others to use and adapt.

Key facts

NIH application ID
10199242
Project number
1R25GM141505-01
Recipient
JOHNS HOPKINS UNIVERSITY
Principal Investigator
ROGER PENG
Activity code
R25
Funding institute
NIH
Fiscal year
2021
Award amount
$94,168
Award type
1
Project period
2021-06-01 → 2024-05-31