Record linkage is a class of statistical and algorithmic methods which aim at identifying whether two or more observed records refer to the same statisti- cal entity or not. Duplications of the same entity within one single source or across different files may be interpreted as “clusters of records”, showing strong similar- ities across their fields. In this paper we frame the record linkage process into a formal Bayesian clustering model and we investigate the role of species sampling models as the natural prior specification for the clustering structure. In fact the dif- ferent latent entities which produce the records observed in one or more data sources can be effectively treated as the sampled species and the observed records as noisy measurements of their features. We also discuss an important issue in the cluster- ing approach to entity resolution, that is the necessity of bounding the clusters sizes even in the presence of large data sets.
Bayesian Nonparametric Methods for Record linkage / Liseo, Brunero; Tancredi, Andrea. - ELETTRONICO. - (2016). (Intervento presentato al convegno 48th SIS Scientific Meeting of the Italian Statistical Society tenutosi a Salerno nel June 2016).
Bayesian Nonparametric Methods for Record linkage
LISEO, Brunero;TANCREDI, ANDREA
2016
Abstract
Record linkage is a class of statistical and algorithmic methods which aim at identifying whether two or more observed records refer to the same statisti- cal entity or not. Duplications of the same entity within one single source or across different files may be interpreted as “clusters of records”, showing strong similar- ities across their fields. In this paper we frame the record linkage process into a formal Bayesian clustering model and we investigate the role of species sampling models as the natural prior specification for the clustering structure. In fact the dif- ferent latent entities which produce the records observed in one or more data sources can be effectively treated as the sampled species and the observed records as noisy measurements of their features. We also discuss an important issue in the cluster- ing approach to entity resolution, that is the necessity of bounding the clusters sizes even in the presence of large data sets.File | Dimensione | Formato | |
---|---|---|---|
Tancredi_Bayesian-Nonparametric-Methods_2016.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
347.53 kB
Formato
Adobe PDF
|
347.53 kB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.