BICO: BIRCH meets coresets for k-means clustering

Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian

doi:10.1007/978-3-642-40450-4_41

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen) and with approximation algorithms (Stream-KM++, StreamLS) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

BICO: BIRCH meets coresets for k-means clustering / Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian. - 8125:(2013), pp. 481-492. (Intervento presentato al convegno 21st Annual European Symposium on Algorithms, ESA 2013 tenutosi a Sophia Antipolis, fra nel 2013) [10.1007/978-3-642-40450-4_41].

BICO: BIRCH meets coresets for k-means clustering

Fichtenberger, Hendrik;Gillé, Marc;Schmidt, Melanie;Schwiegelshohn, Chris;Sohler, Christian

2013

Abstract

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen) and with approximation algorithms (Stream-KM++, StreamLS) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2013
			
	Nome convegno
	
				21st Annual European Symposium on Algorithms, ESA 2013
			
	Parole chiave
	
				Theoretical Computer Science, Computer Science (all)
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				BICO: BIRCH meets coresets for k-means clustering / Fichtenberger, Hendrik; Gillé, Marc; Schmidt, Melanie; Schwiegelshohn, Chris; Sohler, Christian. - 8125:(2013), pp. 481-492. (Intervento presentato al  convegno 21st Annual European Symposium on Algorithms, ESA 2013 tenutosi a Sophia Antipolis, fra nel 2013) [10.1007/978-3-642-40450-4_41].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1085849

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

49

ND

Catalogo dei prodotti della ricerca