Correspondence analysis and Multiple Correspondence Analysis (MCA) (Benzécri, 1973; Greenacre, 1984) are descriptive/exploratory techniques designed to analyse and explore statistical relations between categorical variables in simple two-way and multi-way contingency tables. Results provide information which is similar in nature to those produced by factorial techniques, such as Principal Component Analysis (PCA) and Factor Analysis (FA), that induce a dimension reduction of variables. Clustering algorithms are traditionally viewed as unsupervised methods for data analysis with the aim to group the observed units, by splitting these into homogeneous clusters according to some notion of similarity. K- Means (MacQueen, 1967) is the most commonly used algorithm to automatically partition a data set into K groups. Frequently, when big data are observed, or also when the researcher wishes to simplify the interpretation of the observed data, a factorial method, followed sequentially, by a clustering algorithm, is applied. In fact, tandem analysis, (Arabie and Hubert, 1994) corresponds to applying a clustering algorithm on the latent scores of the first few components. However, many authors (De Sarbo et al., 1990; De Soete and Carroll, 1994) warn against this approach, because the traditional factorial methods (e.g. PCA, MCA, FA) may identify dimensions that do not necessarily contribute much to perceive the clustering structure in the data. Reduced K-means (De Soete and Carroll, 1994) and Factorial K-means (Vichi, Kiers, 2001) are two methodologies that simultaneously identify the best partition of objects and factors that best explain the partition when continuous variables are observed. In the present work, a new methodology for simultaneous clustering and dimensionality reduction is proposed for categorical data. A simulation study and an application on real data are included to evaluate the performance of the proposed methodology.
MULTIPLE CORRESPONDENCE ANALYSIS AND K-MEANS: A NEW APPROACH FOR SIMULTANEOUS DIMENSION REDUCTION AND CLUSTERING / Fordellone, Mario; Mariofordellone@uniroma1it, Sapienza University of Rome mail: m. a. r. i. o. f. o. r. d. e. l. l. o. n. e. @. u. n. i. r. o. m. a. 1. it; Vichi, Maurizio; Mauriziovichi@uniroma1it, Sapienza University of Rome mail: m. a. u. r. i. z. i. o. v. i. c. h. i. @. u. n. i. r. o. m. a. 1. it. - ELETTRONICO. - (2016). (Intervento presentato al convegno DATA SCIENCE AND SOCIAL RESEARCH tenutosi a Napoli nel 17-19 February 2016).
MULTIPLE CORRESPONDENCE ANALYSIS AND K-MEANS: A NEW APPROACH FOR SIMULTANEOUS DIMENSION REDUCTION AND CLUSTERING
VICHI, Maurizio;
2016
Abstract
Correspondence analysis and Multiple Correspondence Analysis (MCA) (Benzécri, 1973; Greenacre, 1984) are descriptive/exploratory techniques designed to analyse and explore statistical relations between categorical variables in simple two-way and multi-way contingency tables. Results provide information which is similar in nature to those produced by factorial techniques, such as Principal Component Analysis (PCA) and Factor Analysis (FA), that induce a dimension reduction of variables. Clustering algorithms are traditionally viewed as unsupervised methods for data analysis with the aim to group the observed units, by splitting these into homogeneous clusters according to some notion of similarity. K- Means (MacQueen, 1967) is the most commonly used algorithm to automatically partition a data set into K groups. Frequently, when big data are observed, or also when the researcher wishes to simplify the interpretation of the observed data, a factorial method, followed sequentially, by a clustering algorithm, is applied. In fact, tandem analysis, (Arabie and Hubert, 1994) corresponds to applying a clustering algorithm on the latent scores of the first few components. However, many authors (De Sarbo et al., 1990; De Soete and Carroll, 1994) warn against this approach, because the traditional factorial methods (e.g. PCA, MCA, FA) may identify dimensions that do not necessarily contribute much to perceive the clustering structure in the data. Reduced K-means (De Soete and Carroll, 1994) and Factorial K-means (Vichi, Kiers, 2001) are two methodologies that simultaneously identify the best partition of objects and factors that best explain the partition when continuous variables are observed. In the present work, a new methodology for simultaneous clustering and dimensionality reduction is proposed for categorical data. A simulation study and an application on real data are included to evaluate the performance of the proposed methodology.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.