In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries --- i.e., we look at the problem as a version of MaxCover. Although in general MaxCover is approximable to within a factor of 1-1/e ~0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal --- as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MaxCover instances

Essential web pages are easy to find / Baeza Yates, Ricardo; Boldi, Paolo; Chierichetti, Flavio. - (2015), pp. 97-107. (Intervento presentato al convegno 24th International Conference on World Wide Web, WWW 2015 tenutosi a Firenze nel 2015) [10.1145/2736277.2741100].

Essential web pages are easy to find

CHIERICHETTI, FLAVIO
2015

Abstract

In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries --- i.e., we look at the problem as a version of MaxCover. Although in general MaxCover is approximable to within a factor of 1-1/e ~0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal --- as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MaxCover instances
2015
24th International Conference on World Wide Web, WWW 2015
Approximation algorithms; Click graph; Greedy algorithms.; Layered indices; Max cover; Query log analysis; Tiering; Web search; Computer Networks and Communications; Software
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Essential web pages are easy to find / Baeza Yates, Ricardo; Boldi, Paolo; Chierichetti, Flavio. - (2015), pp. 97-107. (Intervento presentato al convegno 24th International Conference on World Wide Web, WWW 2015 tenutosi a Firenze nel 2015) [10.1145/2736277.2741100].
File allegati a questo prodotto
File Dimensione Formato  
Chierichetti_Essential_2015.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 713.45 kB
Formato Adobe PDF
713.45 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/877846
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 2
social impact