Generalized Structured Component Analysis for Topic Modeling

Ortu, Marco
2025-01-01

Abstract

This paper presents a novel methodological framework for discovering and analyzing topic relationships in document collections using Generalized Structured Component Analysis (GSCA). While traditional document clustering approaches often rely on dominant topic assignment or simple similarity measures, our method leverages the full probabilistic nature of topic distributions and uncovers complex structural relationships between topics. We propose an unsupervised approach that starts with a fully connected path matrix and systematically identifies significant relationships through a rigorous statistical procedure combining bootstrap-based validation, regularization, and out-of-bag prediction error assessment. The method accommodates optional covariates and provides a robust alternative to conventional clustering techniques. Our framework contributes to both the theoretical understanding of topic relationships and practical applications in document organization and knowledge discovery.
2025
Inglese
Supervised and Unsupervised Statistical Data Analysis
978-3-032-03041-2
341
CLADAG 2025
Comitato scientifico
8-10, Settembre
Napoli
scientifica
Structural Topic Modeling, General Component Analysis, Document Clustering
no
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Ortu, Marco
273
1
4.1 Contributo in Atti di convegno
reserved
info:eu-repo/semantics/conferencePaper
File in questo prodotto:
File Dimensione Formato  
_CLADAG_25__GSCA_and_Topic_Modeling_Document_Clustering.pdf

Solo gestori archivio

Tipologia: versione pre-print
Dimensione 193.81 kB
Formato Adobe PDF
193.81 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Questionario e social

Condividi su:
Impostazioni cookie