The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid ap-proach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 96.9%, and an F1-score of 0.972. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future internet-of-things-based smart healthcare systems.

Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI / Barzegar, Yas; Barzegar, Atrin; Bellini, Francesco; D'Ascenzo, Fabrizio; Gorelova, Irina; Pisani, Patrizio. - In: FUTURE INTERNET. - ISSN 1999-5903. - 17:11(2025). [10.3390/fi17110513]

Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI

Barzegar, Yas
;
Barzegar, Atrin;Bellini, Francesco;D'Ascenzo, Fabrizio;Gorelova, Irina;
2025

Abstract

The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid ap-proach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 96.9%, and an F1-score of 0.972. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future internet-of-things-based smart healthcare systems.
2025
AI; Smart Healthcare; ML; Diagnosis; Hybrid Resampling; Interpretability; Feature Selection; Future Internet
01 Pubblicazione su rivista::01a Articolo in rivista
Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI / Barzegar, Yas; Barzegar, Atrin; Bellini, Francesco; D'Ascenzo, Fabrizio; Gorelova, Irina; Pisani, Patrizio. - In: FUTURE INTERNET. - ISSN 1999-5903. - 17:11(2025). [10.3390/fi17110513]
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1755154
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact