Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.

Robust entity resolution using random graphs / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - In: PROCEEDINGS - ACM-SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. - ISSN 0730-8078. - (2018), pp. 3-18. ( SIGMOD 2018 (Rank A* conference, CORE2023) Houston, TX, USA ) [10.1145/3183713.3183755].

Robust entity resolution using random graphs

Firmani Donatella;
2018

Abstract

Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.
2018
SIGMOD 2018 (Rank A* conference, CORE2023)
Software; Information Systems
04 Pubblicazione in atti di convegno::04c Atto di convegno in rivista
Robust entity resolution using random graphs / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - In: PROCEEDINGS - ACM-SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. - ISSN 0730-8078. - (2018), pp. 3-18. ( SIGMOD 2018 (Rank A* conference, CORE2023) Houston, TX, USA ) [10.1145/3183713.3183755].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1640576
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 22
  • ???jsp.display-item.citation.isi??? 14
social impact