Cardinality estimation, also known as count-distinct, is the problem of finding the number of different elements in a set with repeated elements. Among the many approximate algorithms proposed for this task, HyperLogLog (HLL) has established itself as the state of the art due to its ability to accurately estimate cardinality over a large range of values using a small memory footprint. When elements arrive in a stream, as in the case of most networking applications, improved techniques are possible. We specifically propose a new algorithm that improves the accuracy of cardinality estimation by grouping counters, and by using their new organization to further track all updates within a given counter size range (compared with just the last update as in the standard HLL). Results show that when using the same number of counters, one configuration of the new scheme reduces the relative error by approximately 0.86x using the same amount of memory as the streaming HLL and another configuration achieves a similar accuracy reducing the memory needed by approximately 0.85x.
More Accurate Streaming Cardinality Estimation With Vectorized Counters / Bruschi, Valerio; Reviriego, Pedro; Pontarelli, Salvatore; Ting, Daniel; Bianchi, Giuseppe. - In: IEEE NETWORKING LETTERS. - ISSN 2576-3156. - 3:2(2021), pp. 75-79. [10.1109/LNET.2021.3076048]
More Accurate Streaming Cardinality Estimation With Vectorized Counters
Pontarelli, Salvatore;
2021
Abstract
Cardinality estimation, also known as count-distinct, is the problem of finding the number of different elements in a set with repeated elements. Among the many approximate algorithms proposed for this task, HyperLogLog (HLL) has established itself as the state of the art due to its ability to accurately estimate cardinality over a large range of values using a small memory footprint. When elements arrive in a stream, as in the case of most networking applications, improved techniques are possible. We specifically propose a new algorithm that improves the accuracy of cardinality estimation by grouping counters, and by using their new organization to further track all updates within a given counter size range (compared with just the last update as in the standard HLL). Results show that when using the same number of counters, one configuration of the new scheme reduces the relative error by approximately 0.86x using the same amount of memory as the streaming HLL and another configuration achieves a similar accuracy reducing the memory needed by approximately 0.85x.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.