We investigate the problem of estimating on the fly the frequency at which items recur in large scale distributed data streams, which has become the norm in cloud-based application. This paper presents CASE, a combination of tools and probabilistic algorithms from the data streaming model. In this model, functions are estimated on a huge sequence of data items, in an online fashion, and with a very small amount of memory with respect to both the size of the input stream and the values domain from which data items are drawn. We derive upper and lower bounds on the quality of CASE, improving upon the Count-Min sketch algorithm which has, so far, been the best algorithm in terms of space and time performance to estimate the frequency of data items. We prove that CASE guarantees an (e, d)-approximation of the frequency of all the items, provided they are not rare, in a space- efficient way and for any input stream. Experiments on both synthetic and real datasets confirm our analysis.

Estimating the Frequency of Data Items in Massive Distributed Streams / Anceaume, Emmanuelle; Busnel, Yann; RIVETTI DI VAL CERVO, Nicolo'. - ELETTRONICO. - (2015), pp. 59-66. (Intervento presentato al convegno 4th IEEE Symposium on Network Cloud Computing and Applications, NCCA 2015 tenutosi a Munich; Germany nel 11-12 June 2015) [10.1109/NCCA.2015.19].

Estimating the Frequency of Data Items in Massive Distributed Streams

RIVETTI DI VAL CERVO, NICOLO'
2015

Abstract

We investigate the problem of estimating on the fly the frequency at which items recur in large scale distributed data streams, which has become the norm in cloud-based application. This paper presents CASE, a combination of tools and probabilistic algorithms from the data streaming model. In this model, functions are estimated on a huge sequence of data items, in an online fashion, and with a very small amount of memory with respect to both the size of the input stream and the values domain from which data items are drawn. We derive upper and lower bounds on the quality of CASE, improving upon the Count-Min sketch algorithm which has, so far, been the best algorithm in terms of space and time performance to estimate the frequency of data items. We prove that CASE guarantees an (e, d)-approximation of the frequency of all the items, provided they are not rare, in a space- efficient way and for any input stream. Experiments on both synthetic and real datasets confirm our analysis.
2015
4th IEEE Symposium on Network Cloud Computing and Applications, NCCA 2015
approximation theory;cloud computing;data handling;probability;CASE;approximation;cloud-based application;count-min sketch algorithm;data items;data streaming model;frequency estimation;large scale distributed data streams;massive distributed streams;probabilistic algorithms;values domain;Approximation algorithms;Approximation methods;Computational modeling;Computer aided software engineering;Data models;Estimation;Frequency estimation;Data stream model;frequency estimation;randomized approximation algorithm
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Estimating the Frequency of Data Items in Massive Distributed Streams / Anceaume, Emmanuelle; Busnel, Yann; RIVETTI DI VAL CERVO, Nicolo'. - ELETTRONICO. - (2015), pp. 59-66. (Intervento presentato al convegno 4th IEEE Symposium on Network Cloud Computing and Applications, NCCA 2015 tenutosi a Munich; Germany nel 11-12 June 2015) [10.1109/NCCA.2015.19].
File allegati a questo prodotto
File Dimensione Formato  
Anceaume_Estimating_2015.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 636.09 kB
Formato Adobe PDF
636.09 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/854244
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 1
social impact