Catalogo dei prodotti della ricerca

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Compressed Web indexes / Chierichetti, Flavio; Ravi, Kumar; Raghavan, Prabhakar. - (2009), pp. 451-460. ( 18th International World Wide Web Conference, WWW 2009 Madrid 20 April 2009 through 24 April 2009) [10.1145/1526709.1526770].

Compressed Web indexes

CHIERICHETTI, FLAVIO;Ravi Kumar;Raghavan Prabhakar

2009

Abstract

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2009
			
	Nome convegno
	
				18th International World Wide Web Conference, WWW 2009
			
	Parole chiave
	
				compression; double-pareto; index size; power law
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Compressed Web indexes / Chierichetti, Flavio; Ravi, Kumar; Raghavan, Prabhakar. - (2009), pp. 451-460. ( 18th International World Wide Web Conference, WWW 2009 Madrid 20 April 2009 through 24 April 2009) [10.1145/1526709.1526770].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/497780

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

17

ND

social impact