We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen) and with approximation algorithms (Stream-KM++, StreamLS) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

BICO: BIRCH meets coresets for k-means clustering / Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian. - 8125:(2013), pp. 481-492. (Intervento presentato al convegno 21st Annual European Symposium on Algorithms, ESA 2013 tenutosi a Sophia Antipolis, fra nel 2013) [10.1007/978-3-642-40450-4_41].

BICO: BIRCH meets coresets for k-means clustering

Schwiegelshohn, Chris
;
2013

Abstract

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen) and with approximation algorithms (Stream-KM++, StreamLS) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.
2013
21st Annual European Symposium on Algorithms, ESA 2013
Theoretical Computer Science, Computer Science (all)
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
BICO: BIRCH meets coresets for k-means clustering / Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian. - 8125:(2013), pp. 481-492. (Intervento presentato al convegno 21st Annual European Symposium on Algorithms, ESA 2013 tenutosi a Sophia Antipolis, fra nel 2013) [10.1007/978-3-642-40450-4_41].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1085849
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 48
  • ???jsp.display-item.citation.isi??? ND
social impact