The organization of records referring to different entities into a taxonomy is crucial for capturing their relationships. Nevertheless, the automatic identification of such relationships often faces inaccuracies due to noise and heterogeneity of records across various sources. Simultaneously, manual maintenance of these relationships proves impractical and lacks scalability. This study addresses these challenges by adopting a weak supervision strategy, in the form of an oracle, to solve a novel Hierarchical Entity Resolution task. Within our framework, records are organized into a tree-like structure that encompasses records at the bottom level and encapsulates entities and categories at the higher levels. To make the most effective use of supervision, we employ a triplet comparison oracle, which takes three records as input and output the most similar pair(s). Finally, we introduce HierER, a querying strategy utilizing record pair similarities to minimize the number of oracle queries while simultaneously maximizing the identification of the hierarchical structure. Theoretical and empirical analyses demonstrate the effectiveness and efficiency of HierER with noisy datasets with millions of records.

Building Taxonomies with Triplet Queries / Firmani, D.; Galhotra, S.; Saha, B.; Srivastava, D.. - 3741:(2024), pp. 14-24. ( 32nd Symposium of Advanced Database Systems Villasimius, Italy ).

Building Taxonomies with Triplet Queries

Firmani D.;Srivastava D.
2024

Abstract

The organization of records referring to different entities into a taxonomy is crucial for capturing their relationships. Nevertheless, the automatic identification of such relationships often faces inaccuracies due to noise and heterogeneity of records across various sources. Simultaneously, manual maintenance of these relationships proves impractical and lacks scalability. This study addresses these challenges by adopting a weak supervision strategy, in the form of an oracle, to solve a novel Hierarchical Entity Resolution task. Within our framework, records are organized into a tree-like structure that encompasses records at the bottom level and encapsulates entities and categories at the higher levels. To make the most effective use of supervision, we employ a triplet comparison oracle, which takes three records as input and output the most similar pair(s). Finally, we introduce HierER, a querying strategy utilizing record pair similarities to minimize the number of oracle queries while simultaneously maximizing the identification of the hierarchical structure. Theoretical and empirical analyses demonstrate the effectiveness and efficiency of HierER with noisy datasets with millions of records.
2024
32nd Symposium of Advanced Database Systems
Taxonomy; Database; Data integration
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Building Taxonomies with Triplet Queries / Firmani, D.; Galhotra, S.; Saha, B.; Srivastava, D.. - 3741:(2024), pp. 14-24. ( 32nd Symposium of Advanced Database Systems Villasimius, Italy ).
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1739420
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact