Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the “feedback effect” is able to improve the performance of record linkage.

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation / Steorts, Rebecca C.; Tancredi, Andrea; Liseo, Brunero. - STAMPA. - (2018), pp. 297-313. [10.1007/978-3-319-99771-1_20].

Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

Tancredi, Andrea;Liseo, Brunero
2018

Abstract

Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the “feedback effect” is able to improve the performance of record linkage.
2018
Privacy in Statistical Databases UNESCO Chair in Data Privacy, International Conference, PSD 2018, Valencia, Spain, September 26–28, 2018, Proceedings
978-3-319-99770-4
978-3-319-99771-1
deduplication; entity resolution; linked data
02 Pubblicazione su volume::02a Capitolo o Articolo
Generalized Bayesian Record Linkage and Regression with Exact Error Propagation / Steorts, Rebecca C.; Tancredi, Andrea; Liseo, Brunero. - STAMPA. - (2018), pp. 297-313. [10.1007/978-3-319-99771-1_20].
File allegati a questo prodotto
File Dimensione Formato  
Tancredi_Generalized-Bayesian-Record-Linkage_2018.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 743.56 kB
Formato Adobe PDF
743.56 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1148805
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? ND
social impact