The problem of estimating the size of a query result has a long history in data management. When the query performs entity resolution (aka record linkage or deduplication), the problem is that of estimating the number of distinct entities, referred to as the entity count. This problem has received attention from the statistics community but it has been largely overlooked in the data management literature. In this work, we formally define the entity count problem from a data management perspective and decompose it into a framework of fundamental steps. We explore approaches from both statistics and data management, systematically identifying a design space for different pipelines that address this problem. Finally, we provide extensive experiments to highlight the strengths and weaknesses of these approaches on real-world benchmarks.
Evaluating Methods for Efficient Entity Count Estimation / Mathew, Jerin George; Firmani, Donatella; Srivastava, Divesh. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - 18:8(2025), pp. 2589-2601. [10.14778/3742728.3742750]
Evaluating Methods for Efficient Entity Count Estimation
Mathew, Jerin George
;Firmani, Donatella;Srivastava, Divesh
2025
Abstract
The problem of estimating the size of a query result has a long history in data management. When the query performs entity resolution (aka record linkage or deduplication), the problem is that of estimating the number of distinct entities, referred to as the entity count. This problem has received attention from the statistics community but it has been largely overlooked in the data management literature. In this work, we formally define the entity count problem from a data management perspective and decompose it into a framework of fundamental steps. We explore approaches from both statistics and data management, systematically identifying a design space for different pipelines that address this problem. Finally, we provide extensive experiments to highlight the strengths and weaknesses of these approaches on real-world benchmarks.| File | Dimensione | Formato | |
|---|---|---|---|
|
Mathew_Evaluating-Methods_2025.pdf
solo gestori archivio
Note: DOI10.14778/3742728.3742750
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
510.74 kB
Formato
Adobe PDF
|
510.74 kB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


