Text categorization is an interesting application of machine learning covering a wide range of possible applications, from document management systems to web mining. In designing such a system it is mandatory to correctly define both a suited preprocessing procedure and an effective document representation as closely related as possible to the semantic nature of document categories. To this aim, relying on a Granular Computing approach and considering a document as an ordered sequence of words, we propose a system able to automatically mine frequent terms, considering as a term not only a single word, but also a subsequence of (a few) consecutive words. The whole classification system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering methods. However, when dealing with unbalanced data sets, i.e. when classes are not evenly represented in the data set, the frequent substructures search procedure must be caref
Automatic text categorization by a Granular Computing approach: Facing unbalanced data sets / POSSEMATO, FRANCESCA; RIZZI, Antonello. - (2013), pp. 1-8. (Intervento presentato al convegno 2013 International Joint Conference on Neural Networks, IJCNN 2013 tenutosi a Dallas; United States nel 4 August 2013 through 9 August 2013) [10.1109/ijcnn.2013.6707082].
Automatic text categorization by a Granular Computing approach: Facing unbalanced data sets
POSSEMATO, FRANCESCA;RIZZI, Antonello
2013
Abstract
Text categorization is an interesting application of machine learning covering a wide range of possible applications, from document management systems to web mining. In designing such a system it is mandatory to correctly define both a suited preprocessing procedure and an effective document representation as closely related as possible to the semantic nature of document categories. To this aim, relying on a Granular Computing approach and considering a document as an ordered sequence of words, we propose a system able to automatically mine frequent terms, considering as a term not only a single word, but also a subsequence of (a few) consecutive words. The whole classification system is tailored to process sequences of atomic elements (i.e., encoded words) by means of an embedding procedure based on clustering methods. However, when dealing with unbalanced data sets, i.e. when classes are not evenly represented in the data set, the frequent substructures search procedure must be carefI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.