Catalogo dei prodotti della ricerca

Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.

Website categorization: A formal approach and robustness analysis in the case of e-commerce detection / Bruni, R.; Bianchi, G.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - 142:(2020). [10.1016/j.eswa.2019.113001]

Website categorization: A formal approach and robustness analysis in the case of e-commerce detection

Bruni R.^Methodology;Bianchi G.

2020

Abstract

Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2020
			
	Parole chiave
	
				Classification; E-commerce; Feature engineering; Machine learning; Surveys; Text mining
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				Website categorization: A formal approach and robustness analysis in the case of e-commerce detection / Bruni, R.; Bianchi, G.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - 142:(2020). [10.1016/j.eswa.2019.113001]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Bruni_Preprint_Website-categorization_2020.pdf accesso aperto Note: https://www.sciencedirect.com/science/article/pii/S0957417419307183?via=ihub Tipologia: Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 909.11 kB Formato Adobe PDF	909.11 kB	Adobe PDF
Bruni_Website-categorization_2020.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.31 MB Formato Adobe PDF Contatta l'autore	1.31 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1328415

Citazioni

ND

31

21

social impact