Estimating the similarity of sets of data is a common operation in computing. Minhash is widely used to estimate similarity by computing a signature for each set and then comparing their signatures. Therefore, signature comparison is an important part of similarity estimation. To make the comparison efficient, the size of the signature components is commonly set to the word size of the processor or to one half or one fourth of it. This enables efficient data manipulation and comparison but is not optimal in terms of storage. For example, 48-bit signatures may be more than enough in many applications but since that size cannot be easily manipulated by most processors, 64-bit signatures are used. This implies a 33.3% memory overhead. In this paper, Bitwise Signature Comparison (BSC), a method that enables the efficient comparison of signature components of any bitwidth is presented and evaluated. The results show that BSC achieves a similar speed to that of the traditional comparison implementation regardless of the size of the signature components. This enables the use of any signature component size enabling better trade-offs in the implementation of similarity estimation sketches.

Bitwise Signature Comparison: Enabling more Efficient Similarity Estimation / Reviriego, P.; Pontarelli, S.; Martinez, J.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - (2022), pp. 1-6. [10.1109/TETC.2022.3221872]

Bitwise Signature Comparison: Enabling more Efficient Similarity Estimation

Pontarelli S.;
2022

Abstract

Estimating the similarity of sets of data is a common operation in computing. Minhash is widely used to estimate similarity by computing a signature for each set and then comparing their signatures. Therefore, signature comparison is an important part of similarity estimation. To make the comparison efficient, the size of the signature components is commonly set to the word size of the processor or to one half or one fourth of it. This enables efficient data manipulation and comparison but is not optimal in terms of storage. For example, 48-bit signatures may be more than enough in many applications but since that size cannot be easily manipulated by most processors, 64-bit signatures are used. This implies a 33.3% memory overhead. In this paper, Bitwise Signature Comparison (BSC), a method that enables the efficient comparison of signature components of any bitwidth is presented and evaluated. The results show that BSC achieves a similar speed to that of the traditional comparison implementation regardless of the size of the signature components. This enables the use of any signature component size enabling better trade-offs in the implementation of similarity estimation sketches.
2022
Big Data; Estimation; Hash functions; Layout; Minhash; Optimization; Plagiarism; Probabilistic logic; Silicon; Similarity estimation
01 Pubblicazione su rivista::01a Articolo in rivista
Bitwise Signature Comparison: Enabling more Efficient Similarity Estimation / Reviriego, P.; Pontarelli, S.; Martinez, J.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - (2022), pp. 1-6. [10.1109/TETC.2022.3221872]
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1674466
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact