We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen) and with approximation algorithms (Stream-KM++, StreamLS) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.
BICO: BIRCH meets coresets for k-means clustering / Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian. - 8125:(2013), pp. 481-492. (Intervento presentato al convegno 21st Annual European Symposium on Algorithms, ESA 2013 tenutosi a Sophia Antipolis, fra nel 2013) [10.1007/978-3-642-40450-4_41].
BICO: BIRCH meets coresets for k-means clustering
Schwiegelshohn, Chris
;
2013
Abstract
We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen) and with approximation algorithms (Stream-KM++, StreamLS) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.