The problem of estimating the size of a query result has a long history in data management. When the query performs entity resolution (aka record linkage or deduplication), the problem is that of estimating the number of distinct entities, referred to as the entity count. This problem has received attention from the statistics community but it has been largely overlooked in the data management literature. In this work, we formally define the entity count problem from a data management perspective and decompose it into a framework of fundamental steps. We explore approaches from both statistics and data management, systematically identifying a design space for different pipelines that address this problem. Finally, we provide extensive experiments to highlight the strengths and weaknesses of these approaches on real-world benchmarks.

Evaluating Methods for Efficient Entity Count Estimation / Mathew, Jerin George; Firmani, Donatella; Srivastava, Divesh. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2589-2601. [10.14778/3742728.3742750]

Evaluating Methods for Efficient Entity Count Estimation

Mathew, Jerin George
;
Firmani, Donatella;Srivastava, Divesh
2025

Abstract

The problem of estimating the size of a query result has a long history in data management. When the query performs entity resolution (aka record linkage or deduplication), the problem is that of estimating the number of distinct entities, referred to as the entity count. This problem has received attention from the statistics community but it has been largely overlooked in the data management literature. In this work, we formally define the entity count problem from a data management perspective and decompose it into a framework of fundamental steps. We explore approaches from both statistics and data management, systematically identifying a design space for different pipelines that address this problem. Finally, we provide extensive experiments to highlight the strengths and weaknesses of these approaches on real-world benchmarks.
2025
data cleaning; entity resolution; natural language
01 Pubblicazione su rivista::01a Articolo in rivista
Evaluating Methods for Efficient Entity Count Estimation / Mathew, Jerin George; Firmani, Donatella; Srivastava, Divesh. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2589-2601. [10.14778/3742728.3742750]
File allegati a questo prodotto
File Dimensione Formato  
Mathew_Evaluating-Methods_2025.pdf

solo gestori archivio

Note: DOI10.14778/3742728.3742750
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 510.74 kB
Formato Adobe PDF
510.74 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1746808
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact