Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.
Website categorization: A formal approach and robustness analysis in the case of e-commerce detection / Bruni, R.; Bianchi, G.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - 142:(2020). [10.1016/j.eswa.2019.113001]
Website categorization: A formal approach and robustness analysis in the case of e-commerce detection
Bruni R.
Methodology
;Bianchi G.
2020
Abstract
Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest. We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case.File | Dimensione | Formato | |
---|---|---|---|
Bruni_Preprint_Website-categorization_2020.pdf
accesso aperto
Note: https://www.sciencedirect.com/science/article/pii/S0957417419307183?via=ihub
Tipologia:
Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
909.11 kB
Formato
Adobe PDF
|
909.11 kB | Adobe PDF | |
Bruni_Website-categorization_2020.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.31 MB
Formato
Adobe PDF
|
1.31 MB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.