PROCSIMA: probability distribution clustering using similarity matrix analysis

Ortu, Marco
2024-01-01

Abstract

This study presents PROCSIMA, a methodological approach to document clustering, that defines a similarity metric derived from Jensen-Shannon divergence, for measuring similarities between topic probability distributions obtained from Topic Modeling techniques, such as Latent Dirichlet Allocation (LDA). Unlike conventional approaches that allocate documents to a singular, most pertinent topic, PROCSIMA allocates the clustering of documents by considering their comprehensive topic distribution. By transforming the similarity matrix into an adjacency matrix and subsequently applying community detection algorithms it defines document clusters. Empirical validation on both synthetic and real-world datasets is performed by PROCSIMA by bootstrapping the optimal number of network communities to outperform traditional clustering methods.
2024
Inglese
Proceedings of the Statistics and Data Science 2024 Conference - New perspectives on Statistics and Data Science
9788855096454
Palermo University Press
Palermo
ITALIA
Michele La Rocca, et al.
Antonella Plaia, Leonardo Egidi, Antonino Abbruzzo
6
Statistics and Data Science Conference 2024
Esperti anonimi
11-12 Aprile 2024
Palermo
scientifica
Document Clustering; Topic Modeling; Textual Similarity Metric
no
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Ortu, Marco
273
1
4.1 Contributo in Atti di convegno
reserved
info:eu-repo/semantics/conferencePaper
Files in This Item:
File Size Format  
SDS_2024.pdf

Solo gestori archivio

Size 2.93 MB
Format Adobe PDF
2.93 MB Adobe PDF & nbsp; View / Open   Request a copy

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Questionnaire and social

Share on:
Impostazioni cookie