# A Training Module for Reproducible Data Science Research

> **NIH NIH R25** · JOHNS HOPKINS UNIVERSITY · 2021 · $94,168

## Abstract

Abstract
Scientific progress depends on the ability of scientists to communicate the details of their
investigations, allowing others to learn new techniques and procedures and to critically review
the process leading to any significant findings. However, this foundational aspect of the
scientific process faces significant challenges. Rapid advances in computing technology have led
to high-throughput data collection coupled with the application of complex statistical
algorithms for data analysis. As a result, it has become nearly impossible to describe the
scientific process precisely using traditional methods of communication. Compounding the
problem of communicating data analytic complexity is the inability of traditional educational
programs to keep up with technological and methodological changes. The shortage of data
analytic skills and the corresponding lack of transparency regarding the scientific process is at
the very core of the reproducibility and replication crisis in science today. In order to address the
problem of scientific irreproducibility, training is needed in the fundamental aspects of good
data analysis and reproducible research. Such training needs to go beyond traditional
approaches which focus on developing a toolbox of statistical methods. While knowledge of
tools and their properties is necessary for good data analysis, it is far from sufficient. Additional
knowledge is required to combine those tools to produce a sound data analysis in a transparent
manner. Furthermore, we must go beyond traditional methods of classroom learning in order to
reach the entire scientific workforce. We will build training modules for improving data science
research by leveraging recent work done by members of the Johns Hopkins Data Science Lab.
We will focus on two primary tracks: (1) strategies for reproducible data science, which include
the higher-level principles for designing good data analyses, recognizing poor data analysis, and
providing a proper critique of a data analysis; and (2) technologies and workflows, which cover
the software tools for doing data analysis in a reproducible, distributable, and reusable manner.
The materials developed in this project will supplement traditional training programs in
biomedical data science fields and will be made entirely open source for others to use and
adapt.

## Key facts

- **NIH application ID:** 10199242
- **Project number:** 1R25GM141505-01
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** ROGER PENG
- **Activity code:** R25 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $94,168
- **Award type:** 1
- **Project period:** 2021-06-01 → 2024-05-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10199242

## Citation

> US National Institutes of Health, RePORTER application 10199242, A Training Module for Reproducible Data Science Research (1R25GM141505-01). Retrieved via AI Analytics 2026-05-24 from https://api.ai-analytics.org/grant/nih/10199242. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*