# Large-scale data integration and harmonization to accurately predict sites facing future health-based drinking water crises

> **NIH NIH R43** · SUPERIOR STATISTICAL RESEARCH, LLC · 2021 · $256,579

## Abstract

Project summary: Up to 45 million people per year in the U.S. are directly impacted by health-based drinking
water problems. This leads to at least 16 million cases of acute gastroenteritis directly linked to pollution at
community water systems, with tens of millions more directly impacted by chemical and organic pollutants.
Impacts are further exacerbated in locations dealing with water scarcity, in under-served populations, and
within other vulnerable populations already suffering from health disparities. Many of these water problems are
the direct result of managerial negligence, inconsistent monitoring, and a lack of the ability to anticipate where
problems may arise next. While the reasons for drinking water problems are complex, if we could anticipate
where health-based drinking water problems were to occur in the future, it could have an immediate
and positive impact on tens of millions of Americans annually. Interestingly, extensive data about water
quality and the performance of municipal water systems already exists in large, disparate databases. These
databases are largely ignored and, when used, are typically used only anecdotally and retroactively.
Preliminary evidence suggests that these existing databases, which contain histories of administrative
violations and sub-threshold water-quality results, can be mined to accurately predict future drinking water
crises. The Superior Statistical Research R&D team is an internationally recognized group of water experts
with cross-cutting expertise in statistics/data analysis/modelling/computing, water-quality monitoring of
biological and chemical contaminants, and the ability to clearly and compellingly translate water-quality and
health information to actionable steps for individuals, organizations and communities. In this Phase I project,
we will show that it is possible to predict water-related, health-based problem areas utilizing already collected,
historical data on water quality and municipal water system performance. We will begin by harmonizing the
disparate water quality and municipal water system performance in two different states (Michigan and Iowa).
We will then utilize machine-learning techniques to predict health-based violation histories and will evaluate our
methods by comparing predicted violations to actual health-based violations in the previous 5 years. Finally,
we will identify at least 10 municipalities determined by our algorithm to be at the highest risk for future health-
based water problems and will do systematic sampling to confirm our model-based predictions. We will then
demonstrate how making these predictions can be leveraged to profitability by exploring how our model-based
predictions can be presented to customers in an economical, usable form. Proof of our concept and profitability
models in two states (Phase I) will set us up for widespread (multi-state) database harmonization and
improvement of the proposed machine-learning/modelling effort in Phase II...

## Key facts

- **NIH application ID:** 10253600
- **Project number:** 1R43ES033134-01
- **Recipient organization:** SUPERIOR STATISTICAL RESEARCH, LLC
- **Principal Investigator:** Nathan L Tintle
- **Activity code:** R43 (R01, R21, SBIR, etc.)
- **Funding institute:** NIH
- **Fiscal year:** 2021
- **Award amount:** $256,579
- **Award type:** 1
- **Project period:** 2021-04-01 → 2022-09-30

## Primary source

NIH RePORTER: https://reporter.nih.gov/project-details/10253600

## Citation

> US National Institutes of Health, RePORTER application 10253600, Large-scale data integration and harmonization to accurately predict sites facing future health-based drinking water crises (1R43ES033134-01). Retrieved via AI Analytics 2026-06-24 from https://api.ai-analytics.org/grant/nih/10253600. Licensed CC0.

---

*[NIH grants dataset](/datasets/nih-grants) · CC0 1.0*
