Finding near neighbors through cluster pruning

Chierichetti, Flavio; Panconesi, Alessandro; Raghavan, Prabhakar; Sozio, Mauro; Tiberi, Alessandro; Upfal, Eli

doi:10.1145/1265530.1265545

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation. Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memory. Copyright 2007 ACM.

Finding near neighbors through cluster pruning / Chierichetti, Flavio; Panconesi, Alessandro; Prabhakar, Raghavan; Mauro, Sozio; Alessandro, Tiberi; Eli, Upfal. - (2007), pp. 103-112. (Intervento presentato al convegno 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 tenutosi a Beijing nel 11 June 2007 through 13 June 2007) [10.1145/1265530.1265545].

Finding near neighbors through cluster pruning

CHIERICHETTI, FLAVIO;PANCONESI, Alessandro;Prabhakar Raghavan;Mauro Sozio;Alessandro Tiberi;Eli Upfal

2007

Abstract

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation. Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memory. Copyright 2007 ACM.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2007
			
	Nome convegno
	
				26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
			
	Parole chiave
	
				clustering; generative model; nearest neighbor
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Finding near neighbors through cluster pruning / Chierichetti, Flavio; Panconesi, Alessandro; Prabhakar, Raghavan; Mauro, Sozio; Alessandro, Tiberi; Eli, Upfal. - (2007), pp. 103-112. (Intervento presentato al  convegno 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 tenutosi a Beijing nel 11 June 2007 through 13 June 2007) [10.1145/1265530.1265545].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/358429

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

28

ND

Catalogo dei prodotti della ricerca