Federated Learning (FL) enables collaborative training of Machine Learning (ML) models across decentralized clients while preserving data privacy. One of the challenges that FL faces is when the clients’ data is not independent and identically distributed (non-IID). It is, therefore, crucial to quantify how non-IID data impacts performance. However, due to the limited number of federated data available, it is not easy to carry out real-world simulations. In this work, we propose for the first time (1) the Hist-Dirichlet-based and Min-Size-Dirichlet methods for partitioning data into multiple nodes using the features and quantity distribution and the Dirichlet distribution. We use the (2) Jensen-Shannon and Hellinger distances for quantifying the degree of IID data. Moreover, we implemented (3) state-of-the-art partitioning methods based on the labels’ distribution across clients. All our proposals are open-source in a library called FedArtML, publicly available on PyPI. It facilitates research on cross-silo and cross-device FL, allowing a systematic and controlled partition of centralized datasets using the label, features, and quantity skewness. To demonstrate the value of our proposed methods and the robustness of FedArtML, we experimented in the ECG arrhythmia detection field with Physionet 2020 data. Our results demonstrate that our tool generates federated datasets for multi-client model training and accurately measures client distribution heterogeneity. Our approach achieves 48% higher non-IID-ness than existing feature skew methods, providing more granularity. Furthermore, we validate our simulated federated datasets against real-world data, revealing only a 2% F1-Score difference, affirming the method’s real-life applicability.

FedArtML: A Tool to Facilitate the Generation of Non-IID Datasets in a Controlled Way to Support Federated Learning Research / Jimenez, Daniel; Anagnostopoulos, Aris; Chatzigiannakis, Ioannis; Vitaletti, Andrea. - In: IEEE ACCESS. - ISSN 2169-3536. - 12:(2024), pp. 81004-81016. [10.1109/access.2024.3410026]

FedArtML: A Tool to Facilitate the Generation of Non-IID Datasets in a Controlled Way to Support Federated Learning Research

Jimenez, Daniel
Primo
;
Anagnostopoulos, Aris;Chatzigiannakis, Ioannis;Vitaletti, Andrea
2024

Abstract

Federated Learning (FL) enables collaborative training of Machine Learning (ML) models across decentralized clients while preserving data privacy. One of the challenges that FL faces is when the clients’ data is not independent and identically distributed (non-IID). It is, therefore, crucial to quantify how non-IID data impacts performance. However, due to the limited number of federated data available, it is not easy to carry out real-world simulations. In this work, we propose for the first time (1) the Hist-Dirichlet-based and Min-Size-Dirichlet methods for partitioning data into multiple nodes using the features and quantity distribution and the Dirichlet distribution. We use the (2) Jensen-Shannon and Hellinger distances for quantifying the degree of IID data. Moreover, we implemented (3) state-of-the-art partitioning methods based on the labels’ distribution across clients. All our proposals are open-source in a library called FedArtML, publicly available on PyPI. It facilitates research on cross-silo and cross-device FL, allowing a systematic and controlled partition of centralized datasets using the label, features, and quantity skewness. To demonstrate the value of our proposed methods and the robustness of FedArtML, we experimented in the ECG arrhythmia detection field with Physionet 2020 data. Our results demonstrate that our tool generates federated datasets for multi-client model training and accurately measures client distribution heterogeneity. Our approach achieves 48% higher non-IID-ness than existing feature skew methods, providing more granularity. Furthermore, we validate our simulated federated datasets against real-world data, revealing only a 2% F1-Score difference, affirming the method’s real-life applicability.
2024
Centralized datasets; client’s heterogeneity; federated datasets; federated learning; heterogeneity metrics; machine learning; non-IID-ness
01 Pubblicazione su rivista::01a Articolo in rivista
FedArtML: A Tool to Facilitate the Generation of Non-IID Datasets in a Controlled Way to Support Federated Learning Research / Jimenez, Daniel; Anagnostopoulos, Aris; Chatzigiannakis, Ioannis; Vitaletti, Andrea. - In: IEEE ACCESS. - ISSN 2169-3536. - 12:(2024), pp. 81004-81016. [10.1109/access.2024.3410026]
File allegati a questo prodotto
File Dimensione Formato  
Gutierrez_FedArtML_2024.pdf

accesso aperto

Note: https://doi.org/10.1109/access.2024.3410026
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 1.58 MB
Formato Adobe PDF
1.58 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1711428
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact