Object Matching (OM) is the problem of identifying pairs of data-objects coming from different sources and representing the same real world object. Several methods have been proposed to solve OM problems, but none of them seems to be at the same time fully automated and very effective. In this paper we present a fundamentally new suite of methods that instead possesses both these abilities. We adopt a statistical approach based on mixture models, which structures an OM process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. Notably, our suite of methods is distance-independent in the sense that it does not rely on any restrictive assumption on the function to be used when comparing data-objects. Even more interestingly, our approach is not confined to record linkage applications but can be applied to match also other kinds of data-objects. We present several experiments on real data that validate the proposed methods and show their excellent effectiveness. © 2010 IEEE.
Effective automated object matching / Diego, Zardetto; Monica, Scannapieco; Catarci, Tiziana. - (2010), pp. 757-768. (Intervento presentato al convegno 26th IEEE International Conference on Data Engineering, ICDE 2010 tenutosi a Long Beach; United States nel 1 March 2010 through 6 March 2010) [10.1109/icde.2010.5447904].
Effective automated object matching
CATARCI, Tiziana
2010
Abstract
Object Matching (OM) is the problem of identifying pairs of data-objects coming from different sources and representing the same real world object. Several methods have been proposed to solve OM problems, but none of them seems to be at the same time fully automated and very effective. In this paper we present a fundamentally new suite of methods that instead possesses both these abilities. We adopt a statistical approach based on mixture models, which structures an OM process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. Notably, our suite of methods is distance-independent in the sense that it does not rely on any restrictive assumption on the function to be used when comparing data-objects. Even more interestingly, our approach is not confined to record linkage applications but can be applied to match also other kinds of data-objects. We present several experiments on real data that validate the proposed methods and show their excellent effectiveness. © 2010 IEEE.File | Dimensione | Formato | |
---|---|---|---|
VE_2010_11573-209493.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
261.42 kB
Formato
Adobe PDF
|
261.42 kB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.