Mohamad El Mehtedi

UniCa About Professors and Researchers Mohamad El Mehtedi Research Research outcomes (IRIS)

Mohamad El Mehtedi

PROCSIMA: probability distribution clustering using similarity matrix analysis

Ortu, Marco

2024-01-01

Abstract

This study presents PROCSIMA, a methodological approach to document clustering, that defines a similarity metric derived from Jensen-Shannon divergence, for measuring similarities between topic probability distributions obtained from Topic Modeling techniques, such as Latent Dirichlet Allocation (LDA). Unlike conventional approaches that allocate documents to a singular, most pertinent topic, PROCSIMA allocates the clustering of documents by considering their comprehensive topic distribution. By transforming the similarity matrix into an adjacency matrix and subsequently applying community detection algorithms it defines document clusters. Empirical validation on both synthetic and real-world datasets is performed by PROCSIMA by bootstrapping the optimal number of network communities to outperform traditional clustering methods.

Short Card

Tab complete

Full Sheet(DC)

         Anno 
       
        2024 
       
         Lingua/e 
       
        Inglese 
       
         Titolo del Volume 
       
        Proceedings of the Statistics and Data Science 2024 Conference - New perspectives on Statistics and Data Science 
       
         Codice ISBN 
       
        9788855096454 
       
         Nome Editore 
       
        Palermo University Press 
       
         Città Editore 
       
        Palermo 
       
         Nazionalità Editore 
       
        ITALIA 
       
         Autore/i del Volume 
       
        Michele La Rocca, et al. 
       
         Curatore/i del Volume 
       
        Antonella Plaia, Leonardo Egidi, Antonino Abbruzzo 
       
         Numero di pagine 
       
        6 
       
         Titolo del convegno 
       
        Statistics and Data Science Conference 2024 
       
         Referee 
       
        Esperti anonimi 
       
         Periodo del Convegno 
       
        11-12 Aprile 2024 
       
         Luogo del Convegno 
       
        Palermo 
       
         Caratterizzazione prevalente 
       
        scientifica 
       
         Parole chiave 
       
        Document Clustering; Topic Modeling; Textual Similarity Metric 
       
         Presenza di coautori internazionali 
       
        no 
       
         Tipologia 
       
        4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno 
       
         Tutti gli autori 
       
        Ortu, Marco
         
         Tipologia sito docente 
       
        273 
       
         Numero autori 
       
        1 
       
         Tipologia 
       
        4.1 Contributo in Atti di convegno 
       
         Fulltext 
       
        reserved 
       
         Tipologia 
       
        info:eu-repo/semantics/conferencePaper 
       
         Type: 
       
        4.1 Contributo in Atti di convegno

Files in This Item:

File	Size	Format
SDS_2024.pdf Solo gestori archivio Size 2.93 MB Format Adobe PDF & nbsp; View / Open Request a copy	2.93 MB	Adobe PDF	& nbsp; View / Open Request a copy

University of Cagliari

University of Cagliari