Text categorization is an interesting application of machine learning covering a wide range of possible applications, from document management systems to web mining. In designing such a system it is mandatory to correctly define both a suited preprocessing procedure and an effective document representation as closely related as possible to the semantic nature of document categories. To this aim, relying on a Granular Computing approach and considering a document as an ordered sequence of words, we propose a system able to automatically mine frequent terms, considering as a term not only a single word, but also a subsequence of (a few) consecutive words. The whole classification system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering methods. However, when dealing with unbalanced data sets, i.e. when classes are not evenly represented in the data set, the frequent substructures search procedure must be caref

Automatic text categorization by a Granular Computing approach: Facing unbalanced data sets / POSSEMATO, FRANCESCA; RIZZI, Antonello. - (2013), pp. 1-8. (Intervento presentato al convegno 2013 International Joint Conference on Neural Networks, IJCNN 2013 tenutosi a Dallas; United States nel 4 August 2013 through 9 August 2013) [10.1109/ijcnn.2013.6707082].

Automatic text categorization by a Granular Computing approach: Facing unbalanced data sets

POSSEMATO, FRANCESCA;RIZZI, Antonello
2013

Abstract

Text categorization is an interesting application of machine learning covering a wide range of possible applications, from document management systems to web mining. In designing such a system it is mandatory to correctly define both a suited preprocessing procedure and an effective document representation as closely related as possible to the semantic nature of document categories. To this aim, relying on a Granular Computing approach and considering a document as an ordered sequence of words, we propose a system able to automatically mine frequent terms, considering as a term not only a single word, but also a subsequence of (a few) consecutive words. The whole classification system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering methods. However, when dealing with unbalanced data sets, i.e. when classes are not evenly represented in the data set, the frequent substructures search procedure must be caref
2013
2013 International Joint Conference on Neural Networks, IJCNN 2013
frequent substructures mining; unbalanced data sets; granular computing; text categorization
Pubblicazione in atti di convegno::04b Atto di convegno in volume
Automatic text categorization by a Granular Computing approach: Facing unbalanced data sets / POSSEMATO, FRANCESCA; RIZZI, Antonello. - (2013), pp. 1-8. (Intervento presentato al convegno 2013 International Joint Conference on Neural Networks, IJCNN 2013 tenutosi a Dallas; United States nel 4 August 2013 through 9 August 2013) [10.1109/ijcnn.2013.6707082].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/526118
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 0
social impact