Record linkage is a class of statistical and algorithmic methods which aim at identifying whether two or more observed records refer to the same statisti- cal entity or not. Duplications of the same entity within one single source or across different files may be interpreted as “clusters of records”, showing strong similar- ities across their fields. In this paper we frame the record linkage process into a formal Bayesian clustering model and we investigate the role of species sampling models as the natural prior specification for the clustering structure. In fact the dif- ferent latent entities which produce the records observed in one or more data sources can be effectively treated as the sampled species and the observed records as noisy measurements of their features. We also discuss an important issue in the cluster- ing approach to entity resolution, that is the necessity of bounding the clusters sizes even in the presence of large data sets.

Bayesian Nonparametric Methods for Record linkage / Liseo, Brunero; Tancredi, Andrea. - ELETTRONICO. - (2016). (Intervento presentato al convegno 48th SIS Scientific Meeting of the Italian Statistical Society tenutosi a Salerno nel June 2016).

Bayesian Nonparametric Methods for Record linkage

LISEO, Brunero;TANCREDI, ANDREA
2016

Abstract

Record linkage is a class of statistical and algorithmic methods which aim at identifying whether two or more observed records refer to the same statisti- cal entity or not. Duplications of the same entity within one single source or across different files may be interpreted as “clusters of records”, showing strong similar- ities across their fields. In this paper we frame the record linkage process into a formal Bayesian clustering model and we investigate the role of species sampling models as the natural prior specification for the clustering structure. In fact the dif- ferent latent entities which produce the records observed in one or more data sources can be effectively treated as the sampled species and the observed records as noisy measurements of their features. We also discuss an important issue in the cluster- ing approach to entity resolution, that is the necessity of bounding the clusters sizes even in the presence of large data sets.
2016
48th SIS Scientific Meeting of the Italian Statistical Society
Species Sampling; Latent structure; Mixture Models
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Bayesian Nonparametric Methods for Record linkage / Liseo, Brunero; Tancredi, Andrea. - ELETTRONICO. - (2016). (Intervento presentato al convegno 48th SIS Scientific Meeting of the Italian Statistical Society tenutosi a Salerno nel June 2016).
File allegati a questo prodotto
File Dimensione Formato  
Tancredi_Bayesian-Nonparametric-Methods_2016.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 347.53 kB
Formato Adobe PDF
347.53 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/978122
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact