Given a set of data already grouped into classes, the problem of predicting whose class each new data belongs to is often referred to as classification problem. The first set of data is generally called training set, while the second one test set. Classification problems are of fundamental relevance in the fields of data analysis, data mining, etc., and are moreover able to represent several other interesting practical problems. As a consequence, many classification models and algorithms have been proposed in the literature. However, a substantial tradeoff between classification accuracy and computational efficiency exists. We propose here an approach having a very reduced computational burden. Such approach appears therefore suitable for very large data-sets, in particular for those which are so large that cannot be tackled by other approaches (e.g. database streaming). Data records may be composed by both qualitative and quantitative fields. The proposed approach is based on the selection of a set of values for each field. The classification capability of each of such values is computed on the basis of information directly extractable from the training set. Without a priori assumptions on the meaning of the data-set, except that it represents some real-world phenomenon (either physical or sociological or economical, etc.), we carry out a general statistical evaluation, and specialize it to the cases of numerical fields having normal (Gaussian) distribution or binomial (Bernoulli) distribution. The so computed classification capabilities are used for modeling the choice of a tractable number of such values as a binary knapsack problem with unitary weighs. Such version of the problem can be solved in polynomial-time. The selected values are then used for performing a discretization of the data-set and for its classification. Results of the proposed procedure on several data-sets, including some of the UCI repository, are presented and discussed.
Real-time Classification of Large Data Sets using Binary Knapsack / Bruni, Renato. - STAMPA. - (2004). (Intervento presentato al convegno annual conference AIRO tenutosi a Lecce).
Real-time Classification of Large Data Sets using Binary Knapsack
BRUNI, Renato
2004
Abstract
Given a set of data already grouped into classes, the problem of predicting whose class each new data belongs to is often referred to as classification problem. The first set of data is generally called training set, while the second one test set. Classification problems are of fundamental relevance in the fields of data analysis, data mining, etc., and are moreover able to represent several other interesting practical problems. As a consequence, many classification models and algorithms have been proposed in the literature. However, a substantial tradeoff between classification accuracy and computational efficiency exists. We propose here an approach having a very reduced computational burden. Such approach appears therefore suitable for very large data-sets, in particular for those which are so large that cannot be tackled by other approaches (e.g. database streaming). Data records may be composed by both qualitative and quantitative fields. The proposed approach is based on the selection of a set of values for each field. The classification capability of each of such values is computed on the basis of information directly extractable from the training set. Without a priori assumptions on the meaning of the data-set, except that it represents some real-world phenomenon (either physical or sociological or economical, etc.), we carry out a general statistical evaluation, and specialize it to the cases of numerical fields having normal (Gaussian) distribution or binomial (Bernoulli) distribution. The so computed classification capabilities are used for modeling the choice of a tractable number of such values as a binary knapsack problem with unitary weighs. Such version of the problem can be solved in polynomial-time. The selected values are then used for performing a discretization of the data-set and for its classification. Results of the proposed procedure on several data-sets, including some of the UCI repository, are presented and discussed.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.