Containerizing tasks to ensure robust AI/ML data curation pipelines to estimate environmental disparities in the rural south

NIH RePORTER · NIH · R01 · $343,780 · view on reporter.nih.gov ↗

Abstract

Project Summary Our parent R01 addresses a major scientific gap studying how the health of rural populations, including rural racial and ethnic minority groups, is impacted by air pollution and extreme temperature. As part of the parent R01 we are 1) generating a data architecture for air pollution, heat, cold, health, socioeconomic status (SES), urban and rural form, and other factors in Virginia and West Virginia; 2) estimating the disparities in exposure to air pollution and weather (cold, heat, and heat waves) by race/ethnicity, SES, and rurality, accounting for variations in rural form; and 3) estimating the disparities in the associations between exposure to air pollution or weather and health for the very young (adverse birth outcomes) and older populations (hospital admissions for those >65 years), considering differences by various vulnerability factors (low SES, race/ethnicity) and urban/rural form. As part of the parent R01, we have made significant progress in generating a data architecture for air pollution, heat, cold, health, SES, urban/rural form, and other factors. More specifically we have obtained health data from the Centers of Medicare and Medicaid services (CMS). However, while considerable progress has been made in developing data processing pipelines for CMS claims data, the CMS data privacy and confidentiality limitations hinder the sharing of preprocessed datasets, leading to duplication of data cleaning and processing efforts. This duplication of effort can be wasteful, as researchers may not be able to build on each other's work or collaborate effectively. While data cannot be shared, sharing of open processing pipelines is crucial to eliminate duplication efforts and allow for more readily available AI/ML ready data. When workflows are shared containers are critical to ensure reproducibility. With this administrative supplement, our goal is the adoption of containerized data processing tasks to enhance the deployment of AI/ML pipelines for CMS data in the parent R01 and the wider research communities. The use of containers enables the easy exchange or updating of single components in a processing pipeline, which can be reused/recycled across AI/ML pipelines shared by different investigators in the study team and more broadly across research institutions. The adoption of data processing containerized tasks enhances reproducibility of our parent R01 and allows for the optimization of computational resources over High-Performance Computing (HPC). The container-based AI/ML pipelines will accelerate the velocity of research in the parent R01 allowing us to rigorously estimate the disparities in the associations between exposure to air pollution and weather on health outcomes. Furthermore, these improvements are crucial to allow for the dissemination of the workflow pipelines across the wider research community.

Key facts

NIH application ID: 10842665
Project number: 3R01MD016054-02S2
Recipient: YALE UNIVERSITY
Principal Investigator: Michelle L Bell
Activity code: R01
Funding institute: NIH
Fiscal year: 2023
Award amount: $343,780
Award type: 3
Project period: 2022-07-24 → 2024-03-31