One of the most challenging problems in modern science is how to deal with the huge amount of data that today's technologies provide. Several diculties may arise. For instance, the number of samples may be too big and the stream of incoming data may be faster than the algorithm needed to process them. Another common problem is that when data dimension grows also the volume of the space does, leading to a sparsication of the available data. This may cause problems in the statistical analysis since the data needed to support our conclusion often grows exponentially with the dimension. This problem is commonly referred to as the Curse of Dimensionality and it is one of the reasons why high dimensional data can not be analyzed eciently with traditional methods. Classical methods for dimensionality reduction, like principal component analysis and factor analysis, may fail due to a nonlinear structure of the data. In recent years several methods for nonlinear dimensionality reduction have been proposed. A general way to model high dimensional data set is to represent the observations as noisy samples drawn from a probability distribution mu in the real coordinate space of D dimensions. It has been observed that the essential support of mu can be often well approximated by low dimensional sets. These sets can be assumed to be low dimensional manifolds embedded in the ambient dimension D. A manifold is a topologial space which globally may not be Euclidean but in a small neighbor of each point behaves like an Euclidean space. In this setting we call intrinsic dimension the dimension of the manifold, which is usually much lower than the ambient dimension D. Roughly speaking, the intrinsic dimension of a data set can be described as the minimum number of variables needed to represent the data without signicant loss of information. In this work we propose dierent methods aimed at estimate the intrinsic dimension. The rst method we present models the neighbors of each point as stochastic processes, in such a way that a closed form likelihood function can be written. This leads to a closed form maximum likelihood estimator (MLE) for the intrinsic dimension, which has all the good features that a MLE can have. The second method is based on a multiscale singular value decomposition (MSVD) of the data. This method performs singular value decomposition (SVD) on neighbors of increasing size and nd an estimate for the intrinsic dimension studying the behavior of the singular values as the radius of the neighbor increases. We also introduce an algorithm to estimate the model parameters when the data are assumed to be sampled around an unknown number of planes with dierent intrinsic dimensions, embedded in a high dimensional space. This kind of models have many applications in computer vision and patter recognition, where the data can be described by multiple linear structures or need to be clusterized into groups that can be represented by low dimensional hyperplanes. The algorithm relies on both MSVD and spectral clustering, and it is able to estimate the number of planes, their dimension as well as their arrangement in the ambient space. Finally, we propose a novel method for manifold reconstruction based on a multiscale approach, which approximates the manifold from coarse to ne scales with increasing precision. The basic idea is to produce, at a generic scale j, a piecewise linear approximation of the manifold using a collection of low dimensional planes and use those planes to create clusters for the data. At scale j + 1, each cluster is independently approximated by another collection of low dimensional planes.The process is iterated until the desired precision is achieved. This algorithm is fast because it is highly parallelizable and its computational time is independent from the sample size. Moreover this method automatically constructs a tree structure for the data. This feature can be particularly useful in applications which requires an a priori tree data structure. The aim of the collection of methods proposed in this work is to provide algorithms to learn and estimate the underlying structure of high dimensional dataset.

Novel methods for Intrinsic dimension estimation and manifold learning / Lanteri, Alessandro. - (2016 Mar 18).

Novel methods for Intrinsic dimension estimation and manifold learning

LANTERI, ALESSANDRO
18/03/2016

Abstract

One of the most challenging problems in modern science is how to deal with the huge amount of data that today's technologies provide. Several diculties may arise. For instance, the number of samples may be too big and the stream of incoming data may be faster than the algorithm needed to process them. Another common problem is that when data dimension grows also the volume of the space does, leading to a sparsication of the available data. This may cause problems in the statistical analysis since the data needed to support our conclusion often grows exponentially with the dimension. This problem is commonly referred to as the Curse of Dimensionality and it is one of the reasons why high dimensional data can not be analyzed eciently with traditional methods. Classical methods for dimensionality reduction, like principal component analysis and factor analysis, may fail due to a nonlinear structure of the data. In recent years several methods for nonlinear dimensionality reduction have been proposed. A general way to model high dimensional data set is to represent the observations as noisy samples drawn from a probability distribution mu in the real coordinate space of D dimensions. It has been observed that the essential support of mu can be often well approximated by low dimensional sets. These sets can be assumed to be low dimensional manifolds embedded in the ambient dimension D. A manifold is a topologial space which globally may not be Euclidean but in a small neighbor of each point behaves like an Euclidean space. In this setting we call intrinsic dimension the dimension of the manifold, which is usually much lower than the ambient dimension D. Roughly speaking, the intrinsic dimension of a data set can be described as the minimum number of variables needed to represent the data without signicant loss of information. In this work we propose dierent methods aimed at estimate the intrinsic dimension. The rst method we present models the neighbors of each point as stochastic processes, in such a way that a closed form likelihood function can be written. This leads to a closed form maximum likelihood estimator (MLE) for the intrinsic dimension, which has all the good features that a MLE can have. The second method is based on a multiscale singular value decomposition (MSVD) of the data. This method performs singular value decomposition (SVD) on neighbors of increasing size and nd an estimate for the intrinsic dimension studying the behavior of the singular values as the radius of the neighbor increases. We also introduce an algorithm to estimate the model parameters when the data are assumed to be sampled around an unknown number of planes with dierent intrinsic dimensions, embedded in a high dimensional space. This kind of models have many applications in computer vision and patter recognition, where the data can be described by multiple linear structures or need to be clusterized into groups that can be represented by low dimensional hyperplanes. The algorithm relies on both MSVD and spectral clustering, and it is able to estimate the number of planes, their dimension as well as their arrangement in the ambient space. Finally, we propose a novel method for manifold reconstruction based on a multiscale approach, which approximates the manifold from coarse to ne scales with increasing precision. The basic idea is to produce, at a generic scale j, a piecewise linear approximation of the manifold using a collection of low dimensional planes and use those planes to create clusters for the data. At scale j + 1, each cluster is independently approximated by another collection of low dimensional planes.The process is iterated until the desired precision is achieved. This algorithm is fast because it is highly parallelizable and its computational time is independent from the sample size. Moreover this method automatically constructs a tree structure for the data. This feature can be particularly useful in applications which requires an a priori tree data structure. The aim of the collection of methods proposed in this work is to provide algorithms to learn and estimate the underlying structure of high dimensional dataset.
18-mar-2016
File allegati a questo prodotto
File Dimensione Formato  
Tesi dottorato Lanteri

accesso aperto

Note: Tesi Dottorato
Tipologia: Tesi di dottorato
Licenza: Creative commons
Dimensione 1.76 MB
Formato Adobe PDF
1.76 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/905425
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact