Automatic detection and correction of errors into large data sets is a very relevant task in many applications. A core operation in such process is the information reconstruction phase. It consists in the selection of the values that should be put in some of the fields of an erroneous record in order to restore its correctness [1]. Such values are often obtained by means of the so-called data driven approach, i.e. are taken from correct records called donors. In particular, for each erroneous record, a set of donors being as similar as possible to it should be selected [2]. When dealing with large datasets, the donor selection operations are very time consuming. A solution often adopted consists in arresting the donor search performed for each erroneous record before scanning the whole set of donors, using some stopping criterion. However, maximum similarity in this case is not guaranteed, and so the information reconstruction quality lowers. We therefore propose an innovative approach to the donor selection problem, with the aim of reducing the number of donors that must be examined while maintaining good information reconstruction quality. This is obtained by preventively clustering both the set of donors and the set of erroneous records, i.e. partitioning them into collections of subsets so that elements of the same subset are similar [3,4]. In order to deal with massive population data sets, a new clustering algorithm, called algorithm of the spherical neighbourhoods, is proposed. Additional techniques have been developed in order to cluster the set of erroneous records, since their clusterization should ignore their errors. After this, the search for the donors is conducted, for each erroneous record, by examining only the cluster(s) containing the donors which are more similar to it. This procedure sensibly reduces computational times, while producing very good information reconstruction quality.
Clustering for improving Information Reconstruction / Bruni, Renato. - STAMPA. - (2006). (Intervento presentato al convegno annual conference AIRO tenutosi a Cesena, Italy nel 2006).
Clustering for improving Information Reconstruction
BRUNI, Renato
2006
Abstract
Automatic detection and correction of errors into large data sets is a very relevant task in many applications. A core operation in such process is the information reconstruction phase. It consists in the selection of the values that should be put in some of the fields of an erroneous record in order to restore its correctness [1]. Such values are often obtained by means of the so-called data driven approach, i.e. are taken from correct records called donors. In particular, for each erroneous record, a set of donors being as similar as possible to it should be selected [2]. When dealing with large datasets, the donor selection operations are very time consuming. A solution often adopted consists in arresting the donor search performed for each erroneous record before scanning the whole set of donors, using some stopping criterion. However, maximum similarity in this case is not guaranteed, and so the information reconstruction quality lowers. We therefore propose an innovative approach to the donor selection problem, with the aim of reducing the number of donors that must be examined while maintaining good information reconstruction quality. This is obtained by preventively clustering both the set of donors and the set of erroneous records, i.e. partitioning them into collections of subsets so that elements of the same subset are similar [3,4]. In order to deal with massive population data sets, a new clustering algorithm, called algorithm of the spherical neighbourhoods, is proposed. Additional techniques have been developed in order to cluster the set of erroneous records, since their clusterization should ignore their errors. After this, the search for the donors is conducted, for each erroneous record, by examining only the cluster(s) containing the donors which are more similar to it. This procedure sensibly reduces computational times, while producing very good information reconstruction quality.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.