Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within large engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. Moreover, we show that optimizing the efficiency of query execution by careful selection of these terms can further reduce the query costs. More precisely, we show that on our set-up the best 10 terms query canachieve 90% of the accuracy of the best SVM classifier (14000 terms), and if we are willing to tolerate a reduction to 86% of the best SVM, we can build a 10 terms query that can be executed more than twice as fast as the best 10 terms query. Copyright 2006 ACM.

Effective and efficient classification on a search-engine model / Anagnostopoulos, Aristidis; Andrei Z., Broder; Kunal, Punera. - (2006), pp. 208-217. (Intervento presentato al convegno 15th ACM Conference on Information and Knowledge Management, CIKM 2006 tenutosi a Arlington; United States nel 6 November 2006 through 11 November 2006) [10.1145/1183614.1183648].

Effective and efficient classification on a search-engine model

ANAGNOSTOPOULOS, ARISTIDIS;
2006

Abstract

Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the "best" short query that characterizes a document class using operators normally available within large engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. Moreover, we show that optimizing the efficiency of query execution by careful selection of these terms can further reduce the query costs. More precisely, we show that on our set-up the best 10 terms query canachieve 90% of the accuracy of the best SVM classifier (14000 terms), and if we are willing to tolerate a reduction to 86% of the best SVM, we can build a 10 terms query that can be executed more than twice as fast as the best 10 terms query. Copyright 2006 ACM.
2006
15th ACM Conference on Information and Knowledge Management, CIKM 2006
feature selection; query efficiency; search engine; text classification; wand
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Effective and efficient classification on a search-engine model / Anagnostopoulos, Aristidis; Andrei Z., Broder; Kunal, Punera. - (2006), pp. 208-217. (Intervento presentato al convegno 15th ACM Conference on Information and Knowledge Management, CIKM 2006 tenutosi a Arlington; United States nel 6 November 2006 through 11 November 2006) [10.1145/1183614.1183648].
File allegati a questo prodotto
File Dimensione Formato  
VE_2006_11573-332443.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 337.26 kB
Formato Adobe PDF
337.26 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/332443
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? ND
social impact