In a variety of relevant real world problems, tasks of "data mining" and "knowledge discovery" are required. An important example is the correction of erroneous data into very large data sets. This should generally be performed by using as much as possible the correct information contained in such data [1]. In order to reach this aim, a general approach consists in introducing the minimum changes in the erroneous records while preserving as far as possible the original frequency distributions of the data [2]. The correction process is based on information extracted from the data set itself. This methodology is often referred to as "data-driven" approach. Such concepts are formalized into mathematical models of the correction problem. In particular, two alternative models are analyzed. The first one is based on prepositional logic, while the second one relies on mixed integer linear programming. Techniques for an automatic encoding of the general case, as well as techniques for an efficient handling of difficult special cases, are presented. Algorithms based on branching techniques for solving the above described models are discussed. The procedure is applied to data sets having different origins. A first collection of datasets representing human population is considered [3]. A new family of data sets representing television transmitters is also considered. Results are extremely encouraging.
Data Mining for Error Correction of Real World Massive Data Sets / Bruni, Renato. - STAMPA. - (2003). (Intervento presentato al convegno annual conference AIRO tenutosi a Venezia).
Data Mining for Error Correction of Real World Massive Data Sets
BRUNI, Renato
2003
Abstract
In a variety of relevant real world problems, tasks of "data mining" and "knowledge discovery" are required. An important example is the correction of erroneous data into very large data sets. This should generally be performed by using as much as possible the correct information contained in such data [1]. In order to reach this aim, a general approach consists in introducing the minimum changes in the erroneous records while preserving as far as possible the original frequency distributions of the data [2]. The correction process is based on information extracted from the data set itself. This methodology is often referred to as "data-driven" approach. Such concepts are formalized into mathematical models of the correction problem. In particular, two alternative models are analyzed. The first one is based on prepositional logic, while the second one relies on mixed integer linear programming. Techniques for an automatic encoding of the general case, as well as techniques for an efficient handling of difficult special cases, are presented. Algorithms based on branching techniques for solving the above described models are discussed. The procedure is applied to data sets having different origins. A first collection of datasets representing human population is considered [3]. A new family of data sets representing television transmitters is also considered. Results are extremely encouraging.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.