Catalogo dei prodotti della ricerca

Text mining and text classification are gaining more and more importance in AI related research fields. Researchers are particularly focused on classification systems, based on structured data (such as sequences or graphs), facing the challenge of synthesizing interpretable models, exploiting gray-box approaches. In this paper, a novel gray-box text classifier is presented. Documents to be classified are split into their constituent words, or tokens. Groups of frequent m tokens (or m-grams) are suitably mined adopting the Granular Computing framework. By fastText algorithm, each token is encoded in a real-valued vector and a custom-based dissimilarity measure, grounded on the Edit family, is designed specifically to deal with m-grams. Through a clustering procedure the most representative m-grams, pertaining the corpus of documents, are extrapolated and arranged into a Symbolic Histogram representation. The latter allows embedding documents in a well-suited real-valued space in which a standard classifier, such as SVM, can safety operate. Along with the classification procedure, an Evolutionary Algorithm is in charge of performing features selection, which is able to select most relevant symbols – m-grams – for each class. This study shows how symbols can be fruitfully interpreted, allowing an interesting knowledge discovery procedure, in lights with the new requirements of modern explainable AI systems. The effectiveness of the proposed algorithm has been proved through a set of experiments on paper abstracts classification and SMS spam detection.

Mining m-grams by a granular computing approach for text classification / Capillo, A., DE SANTIS, E., FRATTALE MASCIOLI, F.M., Rizzi, A.. - (2020), pp. 350-360. (12th International Joint Conference on Computational Intelligence - NCTA Online Streaming ) [10.5220/0010109803500360].

Mining m-grams by a granular computing approach for text classification

Antonino Capillo;Enrico de Santis;Fabio Massimo Frattale Mascioli;Antonello Rizzi

2020

Abstract

Text mining and text classification are gaining more and more importance in AI related research fields. Researchers are particularly focused on classification systems, based on structured data (such as sequences or graphs), facing the challenge of synthesizing interpretable models, exploiting gray-box approaches. In this paper, a novel gray-box text classifier is presented. Documents to be classified are split into their constituent words, or tokens. Groups of frequent m tokens (or m-grams) are suitably mined adopting the Granular Computing framework. By fastText algorithm, each token is encoded in a real-valued vector and a custom-based dissimilarity measure, grounded on the Edit family, is designed specifically to deal with m-grams. Through a clustering procedure the most representative m-grams, pertaining the corpus of documents, are extrapolated and arranged into a Symbolic Histogram representation. The latter allows embedding documents in a well-suited real-valued space in which a standard classifier, such as SVM, can safety operate. Along with the classification procedure, an Evolutionary Algorithm is in charge of performing features selection, which is able to select most relevant symbols – m-grams – for each class. This study shows how symbols can be fruitfully interpreted, allowing an interesting knowledge discovery procedure, in lights with the new requirements of modern explainable AI systems. The effectiveness of the proposed algorithm has been proved through a set of experiments on paper abstracts classification and SMS spam detection.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2020
			
	Nome convegno
	
				12th International Joint Conference on Computational Intelligence - NCTA
			
	Parole chiave
	
				text mining; text categorization; granular computing; knowledge discovery; explainable AI
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Mining m-grams by a granular computing approach for text classification / Capillo, A., DE SANTIS, E., FRATTALE MASCIOLI, F.M., Rizzi, A.. - (2020), pp. 350-360. (12th International Joint Conference on Computational Intelligence - NCTA Online Streaming ) [10.5220/0010109803500360].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Capillo_Mining_2020.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 392.66 kB Formato Adobe PDF Contatta l'autore	392.66 kB	Adobe PDF	Contatta l'autore
Flyer_IJCCI_SS_2020.pdf accesso aperto Tipologia: Altro materiale allegato Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 3.03 MB Formato Adobe PDF	3.03 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1461477

Citazioni

ND

4

1

social impact