The availability of clean and safe drinking water is essential for public health and sustainable development. This study uses a robust machine learning-based methodology to predict water potability using genuine water quality parameters. The methodology adopted consists of intelligent feature selection along with comparative evaluation of multiple classifiers. As a novelty, this study employs a hybrid resampling technique (Synthetic Minority Over-sampling Technique combined with Edited Nearest Neighbors (SMOTEENN)) that integrates Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN) techniques that improve class balance based on noise reduction and oversampling. The experimental results on the public water quality dataset show strong and well-rounded performance across many performance metrics, with most models achieving high scores based on general predictive quality. Further, visualizations such as Receiver Operating Characteristic (ROC) curves and feature importance plots support interpretability and offer insights regarding model behaviour. The main contribution of this paper is in finding cost-effective and scalable solutions for smart water quality monitoring and decision support in public health and environmental safety systems: the proposed approach performs better with the respect to the current state of the art.
Data-Centric Water Safety Monitoring: A Machine Learning Pipeline with Intelligent Feature Selection for Potability Prediction / Barzegar, Yas; Barzegar, Atrin; Bellini, Francesco; Marrone, Stefano; Pisani, Patrizio; Verde, Laura. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - (2025).
Data-Centric Water Safety Monitoring: A Machine Learning Pipeline with Intelligent Feature Selection for Potability Prediction
YAS BARZEGAR;Francesco Bellini;
2025
Abstract
The availability of clean and safe drinking water is essential for public health and sustainable development. This study uses a robust machine learning-based methodology to predict water potability using genuine water quality parameters. The methodology adopted consists of intelligent feature selection along with comparative evaluation of multiple classifiers. As a novelty, this study employs a hybrid resampling technique (Synthetic Minority Over-sampling Technique combined with Edited Nearest Neighbors (SMOTEENN)) that integrates Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN) techniques that improve class balance based on noise reduction and oversampling. The experimental results on the public water quality dataset show strong and well-rounded performance across many performance metrics, with most models achieving high scores based on general predictive quality. Further, visualizations such as Receiver Operating Characteristic (ROC) curves and feature importance plots support interpretability and offer insights regarding model behaviour. The main contribution of this paper is in finding cost-effective and scalable solutions for smart water quality monitoring and decision support in public health and environmental safety systems: the proposed approach performs better with the respect to the current state of the art.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


