Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.

Website categorization: A formal approach and robustness analysis in the case of e-commerce detection / Bruni, R.; Bianchi, G.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - 142:(2020). [10.1016/j.eswa.2019.113001]

Website categorization: A formal approach and robustness analysis in the case of e-commerce detection

Bruni R.
Methodology
;
Bianchi G.
2020

Abstract

Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.
2020
Classification; E-commerce; Feature engineering; Machine learning; Surveys; Text mining
01 Pubblicazione su rivista::01a Articolo in rivista
Website categorization: A formal approach and robustness analysis in the case of e-commerce detection / Bruni, R.; Bianchi, G.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - 142:(2020). [10.1016/j.eswa.2019.113001]
File allegati a questo prodotto
File Dimensione Formato  
Bruni_Preprint_Website-categorization_2020.pdf

accesso aperto

Note: https://www.sciencedirect.com/science/article/pii/S0957417419307183?via=ihub
Tipologia: Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 909.11 kB
Formato Adobe PDF
909.11 kB Adobe PDF
Bruni_Website-categorization_2020.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.31 MB
Formato Adobe PDF
1.31 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1328415
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 27
  • ???jsp.display-item.citation.isi??? 18
social impact