Object Matching (OM) is the problem of identifying pairs of data-objects coming from different sources and representing the same real world object. Several methods have been proposed to solve OM problems, but none of them seems to be at the same time fully automated and very effective. In this paper we present a fundamentally new suite of methods that instead possesses both these abilities. We adopt a statistical approach based on mixture models, which structures an OM process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. Notably, our suite of methods is distance-independent in the sense that it does not rely on any restrictive assumption on the function to be used when comparing data-objects. Even more interestingly, our approach is not confined to record linkage applications but can be applied to match also other kinds of data-objects. We present several experiments on real data that validate the proposed methods and show their excellent effectiveness. © 2010 IEEE.

Effective automated object matching / Diego, Zardetto; Monica, Scannapieco; Catarci, Tiziana. - (2010), pp. 757-768. (Intervento presentato al convegno 26th IEEE International Conference on Data Engineering, ICDE 2010 tenutosi a Long Beach; United States nel 1 March 2010 through 6 March 2010) [10.1109/icde.2010.5447904].

Effective automated object matching

CATARCI, Tiziana
2010

Abstract

Object Matching (OM) is the problem of identifying pairs of data-objects coming from different sources and representing the same real world object. Several methods have been proposed to solve OM problems, but none of them seems to be at the same time fully automated and very effective. In this paper we present a fundamentally new suite of methods that instead possesses both these abilities. We adopt a statistical approach based on mixture models, which structures an OM process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. Notably, our suite of methods is distance-independent in the sense that it does not rely on any restrictive assumption on the function to be used when comparing data-objects. Even more interestingly, our approach is not confined to record linkage applications but can be applied to match also other kinds of data-objects. We present several experiments on real data that validate the proposed methods and show their excellent effectiveness. © 2010 IEEE.
2010
26th IEEE International Conference on Data Engineering, ICDE 2010
Clustering problems; Component density; Distance measure
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Effective automated object matching / Diego, Zardetto; Monica, Scannapieco; Catarci, Tiziana. - (2010), pp. 757-768. (Intervento presentato al convegno 26th IEEE International Conference on Data Engineering, ICDE 2010 tenutosi a Long Beach; United States nel 1 March 2010 through 6 March 2010) [10.1109/icde.2010.5447904].
File allegati a questo prodotto
File Dimensione Formato  
VE_2010_11573-209493.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 261.42 kB
Formato Adobe PDF
261.42 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/209493
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? 9
social impact