We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: - Determining the set of categories in a given taxonomy spanned by the search results; - Finding the range of metadata values associated with the result set in order to enable “multi-faceted search”; - Estimating the size of the result set; - Data mining associations to the query terms. We present and analyze efficient algorithms for obtaining uniform random samples applicable to any search engine that is based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, for example, Google, Yahoo Search, MSN Search, Ask, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic sample-next(p) method that samples term posting lists with probability p, and show how to construct sample-next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.

Sampling Search-Engine Results / Anagnostopoulos, Aristidis; A. Z., Broder; D., Carmel. - In: WORLD WIDE WEB. - ISSN 1386-145X. - 9:(2006), pp. 397-429. [10.1007/s11280-006-0222-z]

Sampling Search-Engine Results

ANAGNOSTOPOULOS, ARISTIDIS;
2006

Abstract

We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: - Determining the set of categories in a given taxonomy spanned by the search results; - Finding the range of metadata values associated with the result set in order to enable “multi-faceted search”; - Estimating the size of the result set; - Data mining associations to the query terms. We present and analyze efficient algorithms for obtaining uniform random samples applicable to any search engine that is based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, for example, Google, Yahoo Search, MSN Search, Ask, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic sample-next(p) method that samples term posting lists with probability p, and show how to construct sample-next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.
2006
01 Pubblicazione su rivista::01a Articolo in rivista
Sampling Search-Engine Results / Anagnostopoulos, Aristidis; A. Z., Broder; D., Carmel. - In: WORLD WIDE WEB. - ISSN 1386-145X. - 9:(2006), pp. 397-429. [10.1007/s11280-006-0222-z]
File allegati a questo prodotto
File Dimensione Formato  
VE_2006_11573-337348.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 740.59 kB
Formato Adobe PDF
740.59 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/337348
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 22
  • ???jsp.display-item.citation.isi??? 16
social impact