# A modular data analysis ecosystem using portable encapsulated projects

> **NIH NIH R35** · UNIVERSITY OF VIRGINIA · 2022 · $393,767

## Abstract

Project summary
Overview
As the amount of available data increases, it becomes more challenging to process it. Data processing is simple on the
surface: it is a mapping from data to analysis. Unfortunately, too often, this requires a unique structure for each combination
of dataset and analysis. This makes it difﬁcult to do things like run several different analyses on one dataset, or plug several
different datasets to one analysis, because each connection structure must be deﬁned manually.
To alleviate this challenge of linking data to tools, this proposal develops the concept of Portable Encapsulated Projects
(PEP) and a series of tools that read and process such projects. Essentially, the PEP format aims to standardize the
description of data collections, enabling both data providers and data users to communicate through the common interface
of a standard format. Practically, this means individuals who describe their projects using this format will immediately inherit
both greater portability for analysis as well as greater access to external complementary data. This link operates around a
simple, standard, extensible deﬁnition of a project.
Accompanying this, this proposal develops Python and R packages to provide a modular framework with a low barrier to
entry that makes it easy to build robust pipelines and other tools centered around the PEP format. This system presents a
new approach to organizing data-intensive biomedical research projects.
Signiﬁcance and innovation
This proposal sits at the interface of data management and bioinformatics tool development. While signiﬁcant effort is
already dedicated to each of these individually, there has been less focus at the level of connecting the two. This proposal
will build a standardized interface between data and tools in bioinformatics, providing practical advances in formats and tools
to facilitate this interaction. This effort approaches computational projects in a novel way, and builds both concepts and tools
that can revolutionize bioinformatics research. The goal is not to develop new tools, but to make existing tools more easily
applied to existing data.
In computational research, a huge amount of effort is spent in data cleanup: preparing data for analysis. By facilitating the
connection from data to tools, this will encourage re-analysis of existing data with novel analysis techniques, leading to new
discovery. It will also make it easier to analyze new data in tandem with existing data, increasing the value of both. It will
contribute to reusability, larger-scale analysis, portable computing environments, and data sharing.
There is increasing interest in data sharing and accessibility across scientiﬁc domains, and this proposal will facilitate this.
Early versions are already adopted for both local compute and cluster computing at four different research institutions, and
as the project matures, it will unite various research environments around a common data description. This will ...

## Key facts

- **NIH application ID:** 10468680
- **Project number:** 5R35GM128636-05
- **Recipient organization:** UNIVERSITY OF VIRGINIA
- **Principal Investigator:** Nathan Sheffield
- **Activity code:** R35 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2022
- **Award amount:** $393,767
- **Award type:** 5
- **Project period:** 2018-08-01 → 2023-07-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10468680

## Citation

> US National Institutes of Health, RePORTER application 10468680, A modular data analysis ecosystem using portable encapsulated projects (5R35GM128636-05). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10468680. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
