A new fuzzy document clustering algorithm based on topic homogeneity is introduced. In detail, a novel dissimilarity measure is proposed, derived from the $p$-value of a hypothesis test that assesses the homogeneity of topic distributions between two documents. First, the topic distributions are derived through Latent Dirichlet Allocation, and then a bootstrap procedure is applied to obtain the $p$-value. Finally, the resulting dissimilarity matrix is integrated into the fuzzy relational clustering procedure. The performance of the proposal is evaluated using a benchmark dataset.
Topic Homogeneity Test-Based Fuzzy Document Clustering / Sangiovanni, Gian Mario; Kontoghiorghes, Louisa; Colubi, Ana; Ferraro, Maria Brigida. - (2025), pp. 294-303. - STUDIES IN CLASSIFICATION, DATA ANALYSIS, AND KNOWLEDGE ORGANIZATION. [10.1007/978-3-032-03042-9].
Topic Homogeneity Test-Based Fuzzy Document Clustering
Gian Mario Sangiovanni
Primo
;Ana Colubi;Maria Brigida FerraroUltimo
2025
Abstract
A new fuzzy document clustering algorithm based on topic homogeneity is introduced. In detail, a novel dissimilarity measure is proposed, derived from the $p$-value of a hypothesis test that assesses the homogeneity of topic distributions between two documents. First, the topic distributions are derived through Latent Dirichlet Allocation, and then a bootstrap procedure is applied to obtain the $p$-value. Finally, the resulting dissimilarity matrix is integrated into the fuzzy relational clustering procedure. The performance of the proposal is evaluated using a benchmark dataset.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


