# Statistical methods for air-pollution studies using low-cost monitors

> **NIH NIH R01** · JOHNS HOPKINS UNIVERSITY · 2024 · $331,594

## Abstract

Project summary/abstract
 Air pollution research is increasingly adopting emergent cost-effective technologies to measure pollutant
levels at spatial and temporal scales ﬁner than that delivered by the geographically sparse network of regulatory
monitors. Low-cost air-pollution monitors, while promising, introduce a series of data features like need for ﬁeld
co-location and calibration to eliminate noise, spatio-temporally correlated massive datasets, and repeated mea-
sures on exposures. Current statistical methodology for more traditional air-pollution data collection schemes
are not optimized to properly exploit the noisy, high-throughput, and spatio-temporally dependent low-cost data.
This proposal pursues multi-faceted statistical methods development motivated by the unique features of the
low-cost monitoring data to improve the rigor and widen the breadth of scientiﬁc ﬁndings based on such data.
 Our ﬁrst innovation is a spatial-ﬁltering method for calibration of the noisy low-cost data. Regression calibra-
tion of low-cost networks using ﬁeld co-location with regulatory monitors leads to underestimation of air-pollution
peaks – a critical ﬂaw from a health perspective. The current practice also fails to exploit the spatial correlation
among exposure levels in the network. Our proposed ﬁltering approach mitigates both issues and will be used
to produce network-wide calibrated and smooth high resolution spatio-temporal maps of pollutants.
 Our next set of innovations concern proper utilization of the high-throughput data from low-cost networks.
The large low-cost datasets have increased uptake of data-intensive machine-learning (ML) methods like ran-
dom forests (RF) for exposure prediction modeling. However, exposure data are spatio-temporally correlated
and RF encounters numerous issues for dependent data leading to loss of accuracy. We proposed RF-GLS,
a novel extension of RF that explicitly accounts for spatio-temporal correlation to improve predictions. We will
develop extensions of RF-GLS for use in the spatial-ﬁltering, for predicting categorical exposure data (like Air
Quality Index category), and for estimating exposure effects after accounting for confounders. We will use
RF-GLS for predicting personal exposures using the low-cost ambient and wearable network data in Baltimore.
 We recognize that the rich repeated measures data on exposures from low-cost monitors can be directly
used in association studies between health and air-pollution without any ad-hoc and lossy data reduction like
using the mean exposure. We propose a scalar-on-distribution-analysis (SoDA) that uses the entire sample
of exposures as a distribution-valued covariate in association studies. SoDA is tailored to repeated measures
covariates and will be more efﬁcient than the general-purpose SoFR (scalar-on-function-regression). SoDA will
be used to directly assess which aspects of an individual's exposure distribution correlate most with their health,
which in tur...

## Key facts

- **NIH application ID:** 10740909
- **Project number:** 5R01ES033739-03
- **Recipient organization:** JOHNS HOPKINS UNIVERSITY
- **Principal Investigator:** Abhirup Datta
- **Activity code:** R01 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2024
- **Award amount:** $331,594
- **Award type:** 5
- **Project period:** 2022-02-10 → 2026-11-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10740909

## Citation

> US National Institutes of Health, RePORTER application 10740909, Statistical methods for air-pollution studies using low-cost monitors (5R01ES033739-03). Retrieved via AI Analytics 2026-05-22 from https://api.ai-analytics.org/grant/nih/10740909. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
