Catalogo dei prodotti della ricerca

We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: • Determining the set of categories in a given taxonomy spanned by the search results; • Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;"• Estimating the size of the result set; • Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

Sampling Search-Engine Results / Anagnostopoulos, A., A. Z., B., D., C.. - (2005), pp. 245-256. (Proceedings of the 14th international conference on World Wide Web (WWW 2005) Chiba; Japan ) [10.1145/1060745.1060784].

Sampling Search-Engine Results

ANAGNOSTOPOULOS, ARISTIDIS;A. Z. BRODER;D. CARMEL

2005

Abstract

We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: • Determining the set of categories in a given taxonomy spanned by the search results; • Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;"• Estimating the size of the result set; • Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2005
			
	Nome convegno
	
				Proceedings of the 14th international conference on World Wide Web  (WWW 2005)
			
	Parole chiave
	
				Sampling; Search engines; WAND
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Sampling Search-Engine Results / Anagnostopoulos, A., A. Z., B., D., C.. - (2005), pp. 245-256. (Proceedings of the 14th international conference on World Wide Web  (WWW 2005) Chiba; Japan ) [10.1145/1060745.1060784].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
VE_2005_11573-332444.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 254.71 kB Formato Adobe PDF Contatta l'autore	254.71 kB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/332444

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

32

ND

social impact