In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets.
Hierarchical Entity Resolution using an Oracle / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - In: PROCEEDINGS - ACM-SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA. - ISSN 0730-8078. - (2022), pp. 414-428. (Intervento presentato al convegno 48th ACM SIGMOD International Conference on Management of Data (SIGMOD), Class A++ (GII-GRIN rating) tenutosi a Philadelphia, PA; USA) [10.1145/3514221.3526147].
Hierarchical Entity Resolution using an Oracle
Donatella Firmani;
2022
Abstract
In many applications, entity references (i.e., records) and entities need to be organized to capture diverse relationships like type-subtype, is-A (mapping entities to types), and duplicate (mapping records to entities) relationships. However, automatic identification of such relationships is often inaccurate due to noise and heterogeneous representation of records across sources. Similarly, manual maintenance of these relationships is infeasible and does not scale to large datasets. In this work, we circumvent these challenges by considering weak supervision in the form of an oracle to formulate a novel hierarchical ER task. In this setting, records are clustered in a tree-like structure containing records at leaf-level and capturing record-entity (duplicate), entity-type (is-A) and subtype-supertype relationships. For effective use of supervision, we leverage triplet comparison oracle queries that take three records as input and output the most similar pair(s). We develop HierER, a querying strategy that uses record pair similarities to minimize the number of oracle queries while maximizing the identified hierarchical structure. We show theoretically and empirically that HierER is effective under different similarity noise models and demonstrate empirically that HierER can scale up to million-size datasets.File | Dimensione | Formato | |
---|---|---|---|
Galhotra_Hierarchical-entity-resolution_2022.pdf
accesso aperto
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
4.83 MB
Formato
Adobe PDF
|
4.83 MB | Adobe PDF | |
Galhotra_Hierarchical-entity-resolution_copertina_indice_quarta_2022.pdf.pdf
accesso aperto
Tipologia:
Altro materiale allegato
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.98 MB
Formato
Adobe PDF
|
1.98 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.