Correspondence analysis and Multiple Correspondence Analysis (MCA) (Benzécri, 1973; Greenacre, 1984) are descriptive/exploratory techniques designed to analyse and explore statistical relations between categorical variables in simple two-way and multi-way contingency tables. Results provide information which is similar in nature to those produced by factorial techniques, such as Principal Component Analysis (PCA) and Factor Analysis (FA), that induce a dimension reduction of variables. Clustering algorithms are traditionally viewed as unsupervised methods for data analysis with the aim to group the observed units, by splitting these into homogeneous clusters according to some notion of similarity. K- Means (MacQueen, 1967) is the most commonly used algorithm to automatically partition a data set into K groups. Frequently, when big data are observed, or also when the researcher wishes to simplify the interpretation of the observed data, a factorial method, followed sequentially, by a clustering algorithm, is applied. In fact, tandem analysis, (Arabie and Hubert, 1994) corresponds to applying a clustering algorithm on the latent scores of the first few components. However, many authors (De Sarbo et al., 1990; De Soete and Carroll, 1994) warn against this approach, because the traditional factorial methods (e.g. PCA, MCA, FA) may identify dimensions that do not necessarily contribute much to perceive the clustering structure in the data. Reduced K-means (De Soete and Carroll, 1994) and Factorial K-means (Vichi, Kiers, 2001) are two methodologies that simultaneously identify the best partition of objects and factors that best explain the partition when continuous variables are observed. In the present work, a new methodology for simultaneous clustering and dimensionality reduction is proposed for categorical data. A simulation study and an application on real data are included to evaluate the performance of the proposed methodology.

MULTIPLE CORRESPONDENCE ANALYSIS AND K-MEANS: A NEW APPROACH FOR SIMULTANEOUS DIMENSION REDUCTION AND CLUSTERING / Fordellone, Mario; Mariofordellone@uniroma1it, Sapienza University of Rome mail: m. a. r. i. o. f. o. r. d. e. l. l. o. n. e. @. u. n. i. r. o. m. a. 1. it; Vichi, Maurizio; Mauriziovichi@uniroma1it, Sapienza University of Rome mail: m. a. u. r. i. z. i. o. v. i. c. h. i. @. u. n. i. r. o. m. a. 1. it. - ELETTRONICO. - (2016). (Intervento presentato al convegno DATA SCIENCE AND SOCIAL RESEARCH tenutosi a Napoli nel 17-19 February 2016).

MULTIPLE CORRESPONDENCE ANALYSIS AND K-MEANS: A NEW APPROACH FOR SIMULTANEOUS DIMENSION REDUCTION AND CLUSTERING

VICHI, Maurizio;
2016

Abstract

Correspondence analysis and Multiple Correspondence Analysis (MCA) (Benzécri, 1973; Greenacre, 1984) are descriptive/exploratory techniques designed to analyse and explore statistical relations between categorical variables in simple two-way and multi-way contingency tables. Results provide information which is similar in nature to those produced by factorial techniques, such as Principal Component Analysis (PCA) and Factor Analysis (FA), that induce a dimension reduction of variables. Clustering algorithms are traditionally viewed as unsupervised methods for data analysis with the aim to group the observed units, by splitting these into homogeneous clusters according to some notion of similarity. K- Means (MacQueen, 1967) is the most commonly used algorithm to automatically partition a data set into K groups. Frequently, when big data are observed, or also when the researcher wishes to simplify the interpretation of the observed data, a factorial method, followed sequentially, by a clustering algorithm, is applied. In fact, tandem analysis, (Arabie and Hubert, 1994) corresponds to applying a clustering algorithm on the latent scores of the first few components. However, many authors (De Sarbo et al., 1990; De Soete and Carroll, 1994) warn against this approach, because the traditional factorial methods (e.g. PCA, MCA, FA) may identify dimensions that do not necessarily contribute much to perceive the clustering structure in the data. Reduced K-means (De Soete and Carroll, 1994) and Factorial K-means (Vichi, Kiers, 2001) are two methodologies that simultaneously identify the best partition of objects and factors that best explain the partition when continuous variables are observed. In the present work, a new methodology for simultaneous clustering and dimensionality reduction is proposed for categorical data. A simulation study and an application on real data are included to evaluate the performance of the proposed methodology.
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/870494
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact