This study proposes a new method to define the optimal number of groups in cluster analysis, in cases when the clusters’ order is relevant. In this work, the clustering method of k-means is applied to a univariate index, resulting from a Structural Equation Model (SEM). In contrast to the majority of conventional procedures for choosing the number of clusters, the new methodology looks for the greatest number of clearly distinct clusters rather than a more parsimonious one. This method enables the construction of a granular ranking of the units in clusters starting from an index and minimizes the information loss caused by clustering with a low number of groups. Indeed, the classification adds more information to the mere ordering of units: it aids in locating homogeneous groups of elements for which the index value can be considered the same. Namely, it helps in identifying units perceived as similar each other, which should be considered as “ties” in the ranking since they have substantially the same index value. This methodology works well when the goal is to rank units in groups from “the best” to “the worst”, according to a particular measure. The clusters’ number has been chosen, considering the maximum number of significantly different clusters, according to the non-parametric Wilcoxon “rank-sum” test. Since there exists an ordering between clusters, the test compares each cluster with the closest one. An “ad-hoc” algorithm is proposed to define the ideal number of clusters. In this paper, k-means clustering is applied to an index measuring air pollution across European urban areas. A clustering of cities for different air pollution levels is graphically represented. The analysis’ results provide essential information to develop locally tailored policies aimed at the reduction of air pollution in metropolitan areas.
Optimal number of clusters to rank a model-based index / BOTTAZZI SCHENONE, Mariaelena; Grimaccia, Elena; Vichi, Maurizio. - (2024). (Intervento presentato al convegno Conference of European Statistics Stakeholders (CESS) - 2022 tenutosi a Roma).
Optimal number of clusters to rank a model-based index
Mariaelena Bottazzi Schenone
;Elena Grimaccia;Maurizio Vichi
2024
Abstract
This study proposes a new method to define the optimal number of groups in cluster analysis, in cases when the clusters’ order is relevant. In this work, the clustering method of k-means is applied to a univariate index, resulting from a Structural Equation Model (SEM). In contrast to the majority of conventional procedures for choosing the number of clusters, the new methodology looks for the greatest number of clearly distinct clusters rather than a more parsimonious one. This method enables the construction of a granular ranking of the units in clusters starting from an index and minimizes the information loss caused by clustering with a low number of groups. Indeed, the classification adds more information to the mere ordering of units: it aids in locating homogeneous groups of elements for which the index value can be considered the same. Namely, it helps in identifying units perceived as similar each other, which should be considered as “ties” in the ranking since they have substantially the same index value. This methodology works well when the goal is to rank units in groups from “the best” to “the worst”, according to a particular measure. The clusters’ number has been chosen, considering the maximum number of significantly different clusters, according to the non-parametric Wilcoxon “rank-sum” test. Since there exists an ordering between clusters, the test compares each cluster with the closest one. An “ad-hoc” algorithm is proposed to define the ideal number of clusters. In this paper, k-means clustering is applied to an index measuring air pollution across European urban areas. A clustering of cities for different air pollution levels is graphically represented. The analysis’ results provide essential information to develop locally tailored policies aimed at the reduction of air pollution in metropolitan areas.File | Dimensione | Formato | |
---|---|---|---|
Optimal number of clusters to rank a model-based index.pdf
accesso aperto
Note: Articolo
Tipologia:
Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
717.16 kB
Formato
Adobe PDF
|
717.16 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.