Statistical learning (SL) is the study of the generalizable extraction of knowledge from data (Friedman et al. 2001). The concept of learning is used when human expertise does not exist, humans are unable to explain their expertise, solution changes in time, solution needs to be adapted to particular cases. The principal algorithms used in SL are classified in: (i) supervised learning (e.g. regression and classification), it is trained on labelled examples, i.e., input where the desired output is known. In other words, supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs; (ii) unsupervised learning (e.g. association and clustering), it operates on unlabeled examples, i.e., input where the desired output is unknown, in this case the objective is to discover structure in the data (e.g. through a cluster analysis), not to generalize a mapping from inputs to outputs; (iii) semisupervised, it combines both labeled and unlabeled examples to generate an appropriate function or classifier. In a multidimensional context, when the number of variables is very large, or when it is believed that some of these do not contribute much to identify the groups structure in the data set, researchers apply a continuous model for dimensionality reduction as principal component analysis, factorial analysis, correspondence analy sis, etc., and sequentially a discrete clustering model on the object scores computed as Kmeans, mixture models, etc. This approach is called tandem analysis (TA) by Arabie & Hubert (1994). However, De Sarbo et al. (1990) and De Soete & Carrol (1994) warn against this approach, because the methods for dimension reduction may identify dimensions that do not necessarily contribute much to perceive the groups structure in the data and that, on the contrary, may obscure or mask the groups structure that could exist in the data. A solution to this problem is given by a methodology that includes the simultaneous detection of factors and clusters on the computed scores. In the case of continuous data, many alternative methods combining cluster analysis and the search for a reduced set of factors have been proposed, focusing on factorial meth ods, multidimensional scaling or unfolding analysis and clustering (e.g., Heiser 1993, De Soete & Heiser 1993). De Soete & Carroll (1994) proposed an alternative to the Kmeans procedure, named reduced Kmeans (RKM), which appeared to equal the earlier proposed projection pursuit clustering (PPC) (Bolton & Krzanowski 2012). RKM simultaneously searches for a clustering of objects, based on the Kmeans criterion (MacQueen 1967), and a dimensionality reduction of the variables, based on the principal component analysis (PCA). However, this approach may fail to recover the clustering of objects when the data contain much variance in directions orthogonal to the subspace of the data in which the clusters reside (Timmerman et al. 2010). To solve this problem, Vichi & Kiers (2001), proposed the factorial Kmeans (FKM) model. FKM combines Kmeans cluster analysis with PCA, then finding the best subspace that best represents the clustering structure in the data. In other terms FKM works in the reduced space, and simultaneously searches the best partition of objects based on the use of Kmeans criterion, represented by the best reduced orthogonal space, based on the use of PCA. When categorical variables are observed, TA corresponds to apply first multiple correspondence analysis (MCA) and subsequently the Kmeans clustering on the achieved factors. Hwang et al (2007) proposed an extension of MCA that takes into account clusterlevel heterogeneity in respondents’ preferences/choices. The method involves combining MCA and kmeans in a unified framework. The former is used for uncovering a lowdimensional space of multivariate categorical variables while the latter is used for identifying relatively homogeneous clusters of respondents. In the last years, the dimensionality reduction problem is very known also in other statistical contexts such as structural equation modeling (SEM). In fact, in a wide range of SEMs applications, the assumption that data are collected from a single ho mogeneous population, is often unrealistic, and the identification of different groups (clusters) of observations constitutes a critical issue in many fields. Following this research idea, in this doctoral thesis we propose a good review on the more recent statistical models used to solve the dimensionality problem discussed above. In particular, in the first chapter we show an application on hyperspectral data classification using the most used discriminant functions to solve the high di mensionality problem, e.g., the partial least squares discriminant analysis (PLSDA); in the second chapter we present the multiple correspondence Kmeans (MCKM) model proposed by Fordellone & Vichi (2017), which identifies simultaneously the best partition of the N objects described by the best orthogonal linear combination of categorical variables according to a single objective function; finally, in the third chapter we present the partial least squares structural equation modeling Kmeans (PLSSEMKM) proposed by Fordellone & Vichi (2018), which identifies simultane ously the best partition of the N objects described by the best causal relationship among the latent constructs.
Dimensionality reduction and simultaneous classication approaches for complex data: methods and applications / Fordellone, Mario.  (2019 Sep 24).
Dimensionality reduction and simultaneous classication approaches for complex data: methods and applications
FORDELLONE, MARIO
24/09/2019
Abstract
Statistical learning (SL) is the study of the generalizable extraction of knowledge from data (Friedman et al. 2001). The concept of learning is used when human expertise does not exist, humans are unable to explain their expertise, solution changes in time, solution needs to be adapted to particular cases. The principal algorithms used in SL are classified in: (i) supervised learning (e.g. regression and classification), it is trained on labelled examples, i.e., input where the desired output is known. In other words, supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs; (ii) unsupervised learning (e.g. association and clustering), it operates on unlabeled examples, i.e., input where the desired output is unknown, in this case the objective is to discover structure in the data (e.g. through a cluster analysis), not to generalize a mapping from inputs to outputs; (iii) semisupervised, it combines both labeled and unlabeled examples to generate an appropriate function or classifier. In a multidimensional context, when the number of variables is very large, or when it is believed that some of these do not contribute much to identify the groups structure in the data set, researchers apply a continuous model for dimensionality reduction as principal component analysis, factorial analysis, correspondence analy sis, etc., and sequentially a discrete clustering model on the object scores computed as Kmeans, mixture models, etc. This approach is called tandem analysis (TA) by Arabie & Hubert (1994). However, De Sarbo et al. (1990) and De Soete & Carrol (1994) warn against this approach, because the methods for dimension reduction may identify dimensions that do not necessarily contribute much to perceive the groups structure in the data and that, on the contrary, may obscure or mask the groups structure that could exist in the data. A solution to this problem is given by a methodology that includes the simultaneous detection of factors and clusters on the computed scores. In the case of continuous data, many alternative methods combining cluster analysis and the search for a reduced set of factors have been proposed, focusing on factorial meth ods, multidimensional scaling or unfolding analysis and clustering (e.g., Heiser 1993, De Soete & Heiser 1993). De Soete & Carroll (1994) proposed an alternative to the Kmeans procedure, named reduced Kmeans (RKM), which appeared to equal the earlier proposed projection pursuit clustering (PPC) (Bolton & Krzanowski 2012). RKM simultaneously searches for a clustering of objects, based on the Kmeans criterion (MacQueen 1967), and a dimensionality reduction of the variables, based on the principal component analysis (PCA). However, this approach may fail to recover the clustering of objects when the data contain much variance in directions orthogonal to the subspace of the data in which the clusters reside (Timmerman et al. 2010). To solve this problem, Vichi & Kiers (2001), proposed the factorial Kmeans (FKM) model. FKM combines Kmeans cluster analysis with PCA, then finding the best subspace that best represents the clustering structure in the data. In other terms FKM works in the reduced space, and simultaneously searches the best partition of objects based on the use of Kmeans criterion, represented by the best reduced orthogonal space, based on the use of PCA. When categorical variables are observed, TA corresponds to apply first multiple correspondence analysis (MCA) and subsequently the Kmeans clustering on the achieved factors. Hwang et al (2007) proposed an extension of MCA that takes into account clusterlevel heterogeneity in respondents’ preferences/choices. The method involves combining MCA and kmeans in a unified framework. The former is used for uncovering a lowdimensional space of multivariate categorical variables while the latter is used for identifying relatively homogeneous clusters of respondents. In the last years, the dimensionality reduction problem is very known also in other statistical contexts such as structural equation modeling (SEM). In fact, in a wide range of SEMs applications, the assumption that data are collected from a single ho mogeneous population, is often unrealistic, and the identification of different groups (clusters) of observations constitutes a critical issue in many fields. Following this research idea, in this doctoral thesis we propose a good review on the more recent statistical models used to solve the dimensionality problem discussed above. In particular, in the first chapter we show an application on hyperspectral data classification using the most used discriminant functions to solve the high di mensionality problem, e.g., the partial least squares discriminant analysis (PLSDA); in the second chapter we present the multiple correspondence Kmeans (MCKM) model proposed by Fordellone & Vichi (2017), which identifies simultaneously the best partition of the N objects described by the best orthogonal linear combination of categorical variables according to a single objective function; finally, in the third chapter we present the partial least squares structural equation modeling Kmeans (PLSSEMKM) proposed by Fordellone & Vichi (2018), which identifies simultane ously the best partition of the N objects described by the best causal relationship among the latent constructs.File  Dimensione  Formato  

Tesi_dottorato_Fordellone.pdf
accesso aperto
Tipologia:
Tesi di dottorato
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
4.16 MB
Formato
Adobe PDF

4.16 MB  Adobe PDF 
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.