Text clustering methods allow automatic classification of a large set of documents. Many algorithms can be applied using the proposed methods for structured data. However, the corpus, once transformed from unstructured information into structured data, presents a high dimensionality and an overlapping of the clusters that could jeopardize understandability of the cluster description. In this paper, we introduce a new method of detecting centroids of clusters. Centroids represent prototypes of mutually exclusive partitions, and they can therefore facilitate interpretation of the results to describe groups. In this approach, after the preprocessing step, we establish links between documents by using co-occurrence information, within some lexical units. We use centrality measures to weigh texts and classify documents. We analyze 1,650 job announcements, published from January 1st, 2010 to April 5th, 2011 by 496 companies on DB SOUL (System University Orientation and Job).
Text clustering based on centrality measures: an application on job advertisements / Domenica Fioredistella, Iezzi; Mastrangelo, Mario; Scipione, Sarlo. - STAMPA. - (2012), pp. 515-524. (Intervento presentato al convegno 11th International Conference on Textual Data Statistical Analysis tenutosi a Liegi nel 13-15 giugno 2012).
Text clustering based on centrality measures: an application on job advertisements
MASTRANGELO, MARIO;
2012
Abstract
Text clustering methods allow automatic classification of a large set of documents. Many algorithms can be applied using the proposed methods for structured data. However, the corpus, once transformed from unstructured information into structured data, presents a high dimensionality and an overlapping of the clusters that could jeopardize understandability of the cluster description. In this paper, we introduce a new method of detecting centroids of clusters. Centroids represent prototypes of mutually exclusive partitions, and they can therefore facilitate interpretation of the results to describe groups. In this approach, after the preprocessing step, we establish links between documents by using co-occurrence information, within some lexical units. We use centrality measures to weigh texts and classify documents. We analyze 1,650 job announcements, published from January 1st, 2010 to April 5th, 2011 by 496 companies on DB SOUL (System University Orientation and Job).I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.