Catalogo dei prodotti della ricerca

It is well-known in the literature that the main limitations of document clustering techniques are that they operate in a high-dimensional space, and it is difficult to interpret the different clusters once a partition has been obtained. The proposed methods for computing document clustering employ a two-stage process. Initially, it can be observed that the information contained within the document-term matrix exhibits significant sparsity, so a direct application of a clustering technique would be highly inefficient. Consequently, dimensionality reduction is applied. The proposed strategy involves employing latent Dirichlet allocation (LDA) to identify the main topics in the corpus under analysis. To determine the similarity between two documents, the p-value of a hypothesis test of the homogeneity of topic distributions between two documents is computed. This p-value is used as a similarity measure, upon which three different clustering procedures are built. The first two directly employ the new dissimilarity using a hierarchical approach and a fuzzy relational clustering approach, while the other is a test-based approach to clustering. The performance of the clustering methods is then assessed using some benchmark datasets in order to understand the advantages and disadvantages of the proposals.

Hypothesis test based document clustering / Sangiovanni, GIAN MARIO; Kontoghiorghes, Louisa; Colubi, Ana; Ferraro, MARIA BRIGIDA. - (2024), pp. 47-47. (Intervento presentato al convegno 18th International Joint Conference CFE-CMStatistics tenutosi a London; United Kingdom).

Hypothesis test based document clustering

Gian Mario Sangiovanni^Primo;Louisa Kontoghiorghes^Secondo;Ana Colubi^Penultimo;Maria Brigida Ferraro^Ultimo

2024

Abstract

It is well-known in the literature that the main limitations of document clustering techniques are that they operate in a high-dimensional space, and it is difficult to interpret the different clusters once a partition has been obtained. The proposed methods for computing document clustering employ a two-stage process. Initially, it can be observed that the information contained within the document-term matrix exhibits significant sparsity, so a direct application of a clustering technique would be highly inefficient. Consequently, dimensionality reduction is applied. The proposed strategy involves employing latent Dirichlet allocation (LDA) to identify the main topics in the corpus under analysis. To determine the similarity between two documents, the p-value of a hypothesis test of the homogeneity of topic distributions between two documents is computed. This p-value is used as a similarity measure, upon which three different clustering procedures are built. The first two directly employ the new dissimilarity using a hierarchical approach and a fuzzy relational clustering approach, while the other is a test-based approach to clustering. The performance of the clustering methods is then assessed using some benchmark datasets in order to understand the advantages and disadvantages of the proposals.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				18th International Joint Conference CFE-CMStatistics
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04d Abstract in atti di convegno
			
	Citazione
	
				Hypothesis test based document clustering / Sangiovanni, GIAN MARIO; Kontoghiorghes, Louisa; Colubi, Ana; Ferraro, MARIA BRIGIDA. - (2024), pp. 47-47. (Intervento presentato al  convegno 18th International Joint Conference CFE-CMStatistics tenutosi a London; United Kingdom).
			
	Appartiene alla tipologia:
	
				04d Abstract in atti di convegno

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1730572

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact