# Containerizing tasks to ensure robust AI/ML data curation pipelines to estimate environmental disparities in the rural south

> **NIH NIH R01** · YALE UNIVERSITY · 2023 · $343,780

## Abstract

Project Summary
Our parent R01 addresses a major scientific gap studying how the health of rural populations, including rural
racial and ethnic minority groups, is impacted by air pollution and extreme temperature. As part of the parent
R01 we are 1) generating a data architecture for air pollution, heat, cold, health, socioeconomic status (SES),
urban and rural form, and other factors in Virginia and West Virginia; 2) estimating the disparities in exposure
to air pollution and weather (cold, heat, and heat waves) by race/ethnicity, SES, and rurality, accounting for
variations in rural form; and 3) estimating the disparities in the associations between exposure to air pollution
or weather and health for the very young (adverse birth outcomes) and older populations (hospital admissions
for those >65 years), considering differences by various vulnerability factors (low SES, race/ethnicity) and
urban/rural form.
As part of the parent R01, we have made significant progress in generating a data architecture for air pollution,
heat, cold, health, SES, urban/rural form, and other factors. More specifically we have obtained health data
from the Centers of Medicare and Medicaid services (CMS). However, while considerable progress has been
made in developing data processing pipelines for CMS claims data, the CMS data privacy and confidentiality
limitations hinder the sharing of preprocessed datasets, leading to duplication of data cleaning and processing
efforts. This duplication of effort can be wasteful, as researchers may not be able to build on each other's work
or collaborate effectively. While data cannot be shared, sharing of open processing pipelines is crucial to
eliminate duplication efforts and allow for more readily available AI/ML ready data. When workflows are shared
containers are critical to ensure reproducibility.
With this administrative supplement, our goal is the adoption of containerized data processing tasks to
enhance the deployment of AI/ML pipelines for CMS data in the parent R01 and the wider research
communities. The use of containers enables the easy exchange or updating of single components in a
processing pipeline, which can be reused/recycled across AI/ML pipelines shared by different investigators in
the study team and more broadly across research institutions. The adoption of data processing containerized
tasks enhances reproducibility of our parent R01 and allows for the optimization of computational resources
over High-Performance Computing (HPC).
The container-based AI/ML pipelines will accelerate the velocity of research in the parent R01 allowing us to
rigorously estimate the disparities in the associations between exposure to air pollution and weather on health
outcomes. Furthermore, these improvements are crucial to allow for the dissemination of the workflow
pipelines across the wider research community.

## Key facts

- **NIH application ID:** 10842665
- **Project number:** 3R01MD016054-02S2
- **Recipient organization:** YALE UNIVERSITY
- **Principal Investigator:** Michelle L Bell
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2023
- **Award amount:** $343,780
- **Award type:** 3
- **Project period:** 2022-07-24 → 2024-03-31

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10842665

## Citation

> US National Institutes of Health, RePORTER application 10842665, Containerizing tasks to ensure robust AI/ML data curation pipelines to estimate environmental disparities in the rural south (3R01MD016054-02S2). Retrieved via AI Analytics 2026-06-12 from https://api.ai-analytics.org/grant/nih/10842665. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
