Large-scale data integration and harmonization to accurately predict sites facing future health-based drinking water crises

NIH RePORTER · NIH · R43 · $256,579 · view on reporter.nih.gov ↗

Abstract

Project summary: Up to 45 million people per year in the U.S. are directly impacted by health-based drinking water problems. This leads to at least 16 million cases of acute gastroenteritis directly linked to pollution at community water systems, with tens of millions more directly impacted by chemical and organic pollutants. Impacts are further exacerbated in locations dealing with water scarcity, in under-served populations, and within other vulnerable populations already suffering from health disparities. Many of these water problems are the direct result of managerial negligence, inconsistent monitoring, and a lack of the ability to anticipate where problems may arise next. While the reasons for drinking water problems are complex, if we could anticipate where health-based drinking water problems were to occur in the future, it could have an immediate and positive impact on tens of millions of Americans annually. Interestingly, extensive data about water quality and the performance of municipal water systems already exists in large, disparate databases. These databases are largely ignored and, when used, are typically used only anecdotally and retroactively. Preliminary evidence suggests that these existing databases, which contain histories of administrative violations and sub-threshold water-quality results, can be mined to accurately predict future drinking water crises. The Superior Statistical Research R&D team is an internationally recognized group of water experts with cross-cutting expertise in statistics/data analysis/modelling/computing, water-quality monitoring of biological and chemical contaminants, and the ability to clearly and compellingly translate water-quality and health information to actionable steps for individuals, organizations and communities. In this Phase I project, we will show that it is possible to predict water-related, health-based problem areas utilizing already collected, historical data on water quality and municipal water system performance. We will begin by harmonizing the disparate water quality and municipal water system performance in two different states (Michigan and Iowa). We will then utilize machine-learning techniques to predict health-based violation histories and will evaluate our methods by comparing predicted violations to actual health-based violations in the previous 5 years. Finally, we will identify at least 10 municipalities determined by our algorithm to be at the highest risk for future health- based water problems and will do systematic sampling to confirm our model-based predictions. We will then demonstrate how making these predictions can be leveraged to profitability by exploring how our model-based predictions can be presented to customers in an economical, usable form. Proof of our concept and profitability models in two states (Phase I) will set us up for widespread (multi-state) database harmonization and improvement of the proposed machine-learning/modelling effort in Phase II...

Key facts

NIH application ID: 10253600
Project number: 1R43ES033134-01
Recipient: SUPERIOR STATISTICAL RESEARCH, LLC
Principal Investigator: Nathan L Tintle
Activity code: R43
Funding institute: NIH
Fiscal year: 2021
Award amount: $256,579
Award type: 1
Project period: 2021-04-01 → 2022-09-30