Effective and efficient classification on a search-engine model

Anagnostopoulos, Aristidis; Broder, Andrei; Punera, Kunal

doi:10.1007/s10115-007-0102-6

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query. © Springer-Verlag London Limited 2007.

Effective and efficient classification on a search-engine model / Anagnostopoulos, A., Andrei, B., Kunal, P.. - In: KNOWLEDGE AND INFORMATION SYSTEMS. - ISSN 0219-1377. - 16:2(2008), pp. 129-154. [10.1007/s10115-007-0102-6]

Effective and efficient classification on a search-engine model

ANAGNOSTOPOULOS, ARISTIDIS;Andrei Broder;Kunal Punera

2008

Abstract

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query. © Springer-Verlag London Limited 2007.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2008
			
	Parole chiave
	
				feature selection; query efficiency; search engine; term correlations; text classification; wand
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				Effective and efficient classification on a search-engine model / Anagnostopoulos, A., Andrei, B., Kunal, P.. - In: KNOWLEDGE AND INFORMATION SYSTEMS. - ISSN 0219-1377. - 16:2(2008), pp. 129-154. [10.1007/s10115-007-0102-6]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
VE_2008_11573-337347.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.11 MB Formato Adobe PDF Contatta l'autore	1.11 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/337347

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

14

13

Catalogo dei prodotti della ricerca