In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets.

Hierarchical Entity Resolution using an Oracle / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - In: PROCEEDINGS - ACM-SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. - ISSN 0730-8078. - (2022), pp. 414-428. (Intervento presentato al convegno 48th ACM SIGMOD International Conference on Management of Data (SIGMOD), Class A++ (GII-GRIN rating) tenutosi a Philadelphia, PA; USA) [10.1145/3514221.3526147].

Hierarchical Entity Resolution using an Oracle

Donatella Firmani;
2022

Abstract

In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets.
2022
48th ACM SIGMOD International Conference on Management of Data (SIGMOD), Class A++ (GII-GRIN rating)
entity resolution, data integration, data cleaning
04 Pubblicazione in atti di convegno::04c Atto di convegno in rivista
Hierarchical Entity Resolution using an Oracle / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - In: PROCEEDINGS - ACM-SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. - ISSN 0730-8078. - (2022), pp. 414-428. (Intervento presentato al convegno 48th ACM SIGMOD International Conference on Management of Data (SIGMOD), Class A++ (GII-GRIN rating) tenutosi a Philadelphia, PA; USA) [10.1145/3514221.3526147].
File allegati a questo prodotto
File Dimensione Formato  
Galhotra_Hierarchical-entity-resolution_2022.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 4.83 MB
Formato Adobe PDF
4.83 MB Adobe PDF
Galhotra_Hierarchical-entity-resolution_copertina_indice_quarta_2022.pdf.pdf

accesso aperto

Tipologia: Altro materiale allegato
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.98 MB
Formato Adobe PDF
1.98 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1640604
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact