In a variety of relevant real world problems, tasks of "data mining" and "knowledge discovery" are required. An important example is the correction of erroneous data into very large data sets. This should generally be performed by using as much as possible the correct information contained in such data [1]. In order to reach this aim, a general approach consists in introducing the minimum changes in the erroneous records while preserving as far as possible the original frequency distributions of the data [2]. The correction process is based on information extracted from the data set itself. This methodology is often referred to as "data-driven" approach. Such concepts are formalized into mathematical models of the correction problem. In particular, two alternative models are analyzed. The first one is based on prepositional logic, while the second one relies on mixed integer linear programming. Techniques for an automatic encoding of the general case, as well as techniques for an efficient handling of difficult special cases, are presented. Algorithms based on branching techniques for solving the above described models are discussed. The procedure is applied to data sets having different origins. A first collection of datasets representing human population is considered [3]. A new family of data sets representing television transmitters is also considered. Results are extremely encouraging.

Data Mining for Error Correction of Real World Massive Data Sets / Bruni, Renato. - STAMPA. - (2003). (Intervento presentato al convegno annual conference AIRO tenutosi a Venezia).

Data Mining for Error Correction of Real World Massive Data Sets

BRUNI, Renato
2003

Abstract

In a variety of relevant real world problems, tasks of "data mining" and "knowledge discovery" are required. An important example is the correction of erroneous data into very large data sets. This should generally be performed by using as much as possible the correct information contained in such data [1]. In order to reach this aim, a general approach consists in introducing the minimum changes in the erroneous records while preserving as far as possible the original frequency distributions of the data [2]. The correction process is based on information extracted from the data set itself. This methodology is often referred to as "data-driven" approach. Such concepts are formalized into mathematical models of the correction problem. In particular, two alternative models are analyzed. The first one is based on prepositional logic, while the second one relies on mixed integer linear programming. Techniques for an automatic encoding of the general case, as well as techniques for an efficient handling of difficult special cases, are presented. Algorithms based on branching techniques for solving the above described models are discussed. The procedure is applied to data sets having different origins. A first collection of datasets representing human population is considered [3]. A new family of data sets representing television transmitters is also considered. Results are extremely encouraging.
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/498850
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact