Catalogo dei prodotti della ricerca

Data de-duplication is the process of finding records in one or more datasets belonging to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of $N$ different entities. The main novelty of our approach is to consider the population size $N$ as an unknown model parameter. As a result, one salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at the population level. We apply our approach to two synthetic data sets comprising German names. In addition we illustrate a real data application matching records from two lists reporting victims killed in the recent Syrian conflict.

A unified framework for de-duplication and population size estimation (with Discussion) / Tancredi, Andrea; Steorts, Rebecca; Liseo, Brunero. - In: BAYESIAN ANALYSIS. - ISSN 1936-0975. - 15:2(2020), pp. 633-658. [10.1214/19-BA1146]

A unified framework for de-duplication and population size estimation (with Discussion)

Andrea Tancredi;Rebecca Steorts;Brunero Liseo

2020

Abstract

Data de-duplication is the process of finding records in one or more datasets belonging to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of $N$ different entities. The main novelty of our approach is to consider the population size $N$ as an unknown model parameter. As a result, one salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at the population level. We apply our approach to two synthetic data sets comprising German names. In addition we illustrate a real data application matching records from two lists reporting victims killed in the recent Syrian conflict.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2020
			
	Parole chiave
	
				Cluster analysis; Entity resolution; Partition models; Record linkage
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				A unified framework for de-duplication and population size estimation (with Discussion) / Tancredi, Andrea; Steorts, Rebecca; Liseo, Brunero. - In: BAYESIAN ANALYSIS. - ISSN 1936-0975. - 15:2(2020), pp. 633-658. [10.1214/19-BA1146]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Tancredi_ unified-framework_2020.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 610.74 kB Formato Adobe PDF	610.74 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1217559

Citazioni

ND

18

16

social impact