This study proposes a new method to define the optimal number of groups in cluster analysis, in cases when the clusters’ order is relevant. In this work, the clustering method of k-means is applied to a univariate index, resulting from a Structural Equation Model (SEM). In contrast to the majority of conventional procedures for choosing the number of clusters, the new methodology looks for the greatest number of clearly distinct clusters rather than a more parsimonious one. This method enables the construction of a granular ranking of the units in clusters starting from an index and minimizes the information loss caused by clustering with a low number of groups. Indeed, the classification adds more information to the mere ordering of units: it aids in locating homogeneous groups of elements for which the index value can be considered the same. Namely, it helps in identifying units perceived as similar each other, which should be considered as “ties” in the ranking since they have substantially the same index value. This methodology works well when the goal is to rank units in groups from “the best” to “the worst”, according to a particular measure. The clusters’ number has been chosen, considering the maximum number of significantly different clusters, according to the non-parametric Wilcoxon “rank-sum” test. Since there exists an ordering between clusters, the test compares each cluster with the closest one. An “ad-hoc” algorithm is proposed to define the ideal number of clusters. In this paper, k-means clustering is applied to an index measuring air pollution across European urban areas. A clustering of cities for different air pollution levels is graphically represented. The analysis’ results provide essential information to develop locally tailored policies aimed at the reduction of air pollution in metropolitan areas.

Optimal number of clusters to rank a model-based index / BOTTAZZI SCHENONE, Mariaelena; Grimaccia, Elena; Vichi, Maurizio. - (2024). (Intervento presentato al convegno Conference of European Statistics Stakeholders (CESS) - 2022 tenutosi a Roma).

Optimal number of clusters to rank a model-based index

Mariaelena Bottazzi Schenone
;
Elena Grimaccia;Maurizio Vichi
2024

Abstract

This study proposes a new method to define the optimal number of groups in cluster analysis, in cases when the clusters’ order is relevant. In this work, the clustering method of k-means is applied to a univariate index, resulting from a Structural Equation Model (SEM). In contrast to the majority of conventional procedures for choosing the number of clusters, the new methodology looks for the greatest number of clearly distinct clusters rather than a more parsimonious one. This method enables the construction of a granular ranking of the units in clusters starting from an index and minimizes the information loss caused by clustering with a low number of groups. Indeed, the classification adds more information to the mere ordering of units: it aids in locating homogeneous groups of elements for which the index value can be considered the same. Namely, it helps in identifying units perceived as similar each other, which should be considered as “ties” in the ranking since they have substantially the same index value. This methodology works well when the goal is to rank units in groups from “the best” to “the worst”, according to a particular measure. The clusters’ number has been chosen, considering the maximum number of significantly different clusters, according to the non-parametric Wilcoxon “rank-sum” test. Since there exists an ordering between clusters, the test compares each cluster with the closest one. An “ad-hoc” algorithm is proposed to define the ideal number of clusters. In this paper, k-means clustering is applied to an index measuring air pollution across European urban areas. A clustering of cities for different air pollution levels is graphically represented. The analysis’ results provide essential information to develop locally tailored policies aimed at the reduction of air pollution in metropolitan areas.
2024
Conference of European Statistics Stakeholders (CESS) - 2022
Clusters ranking; Wilcoxon rank-sum test; multidimensional index; air pollution; metropolitan areas
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Optimal number of clusters to rank a model-based index / BOTTAZZI SCHENONE, Mariaelena; Grimaccia, Elena; Vichi, Maurizio. - (2024). (Intervento presentato al convegno Conference of European Statistics Stakeholders (CESS) - 2022 tenutosi a Roma).
File allegati a questo prodotto
File Dimensione Formato  
Optimal number of clusters to rank a model-based index.pdf

accesso aperto

Note: Articolo
Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 717.16 kB
Formato Adobe PDF
717.16 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1710416
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact