Estimating the similarity of sets of data is a common operation in computing. Minhash is widely used to estimate similarity by computing a signature for each set and then comparing their signatures. Therefore, signature comparison is an important part of similarity estimation. To make the comparison efficient, the size of the signature components is commonly set to the word size of the processor or to one half or one fourth of it. This enables efficient data manipulation and comparison but is not optimal in terms of storage. For example, 48-bit signatures may be more than enough in many applications but since that size cannot be easily manipulated by most processors, 64-bit signatures are used. This implies a 33.3% memory overhead. In this paper, Bitwise Signature Comparison (BSC), a method that enables the efficient comparison of signature components of any bitwidth is presented and evaluated. The results show that BSC achieves a similar speed to that of the traditional comparison implementation regardless of the size of the signature components. This enables the use of any signature component size enabling better trade-offs in the implementation of similarity estimation sketches.
Bitwise Signature Comparison: Enabling more Efficient Similarity Estimation / Reviriego, P.; Pontarelli, S.; Martinez, J.. - In: IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING. - ISSN 2168-6750. - (2022), pp. 1-6. [10.1109/TETC.2022.3221872]
Bitwise Signature Comparison: Enabling more Efficient Similarity Estimation
Pontarelli S.;
2022
Abstract
Estimating the similarity of sets of data is a common operation in computing. Minhash is widely used to estimate similarity by computing a signature for each set and then comparing their signatures. Therefore, signature comparison is an important part of similarity estimation. To make the comparison efficient, the size of the signature components is commonly set to the word size of the processor or to one half or one fourth of it. This enables efficient data manipulation and comparison but is not optimal in terms of storage. For example, 48-bit signatures may be more than enough in many applications but since that size cannot be easily manipulated by most processors, 64-bit signatures are used. This implies a 33.3% memory overhead. In this paper, Bitwise Signature Comparison (BSC), a method that enables the efficient comparison of signature components of any bitwidth is presented and evaluated. The results show that BSC achieves a similar speed to that of the traditional comparison implementation regardless of the size of the signature components. This enables the use of any signature component size enabling better trade-offs in the implementation of similarity estimation sketches.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.