LLM-Based Auto-Categorization of Android Applications

Pierangelo Loi
First
;
Diego Soi
;
Leonardo Regano
;
Giorgio Giacinto
2026-01-01

Abstract

Many anomaly-based malware detectors implicitly depend on the automatic categorisation of Android apps. Such tools first group apps by their declared functionality and then learn what constitutes "normal" use of sensitive APIs within each group. If this categorisation is wrong or overly coarse, the entire pipeline suffers: benign apps can be flagged as malicious, while malicious or grayware apps can hide among poorly matched neighbours. This highlights the need for precise, robust, and scalable categorisation methods that operate directly on app stores’ metadata and can be plugged into security pipelines. In this paper, we present a free and fully automatic description-based framework for fine-grained categorisation of Android applications, explicitly designed as a drop-in replacement for the categorisation stage in anomaly-based systems. Our pipeline embeds Google Play Store descriptions with a sentence-transformer model, reduces the embeddings with Uniform Manifold Approximation and Projection, and clusters them using K-means while automatically selecting the number of clusters via the Mean Silhouette Coefficient. Finally, it employs a lightweight Large Language Model to generate concise, human-readable labels for each discovered cluster. We evaluate our approach on AndroCatSet, a manually curated ground-truth dataset of 5000 benign apps organised into 50 fine-grained classes. The resulting categorisation component yields semantically coherent and interpretable functional groups, which can be readily integrated into security pipelines to strengthen the detection of miscategorised, malicious, and grayware apps whose actual behaviour diverges from their declared purpose.
2026
Inglese
ITASEC & SERICS 2026 Joint National Conference on Cybersecurity 2026. Proceedings of the Joint National Conference on Cybersecurity (ITASEC & SERICS 2026) Cagliari, Italy, February 09-13, 2026
4198
11
https://ceur-ws.org/Vol-4198/
https://ceur-ws.org/Vol-4198/paper33.pdf
ITASEC & SERICS 2026 Joint National Conference on Cybersecurity 2026
Comitato scientifico
February 09-13, 2026
Cagliari, Italy
scientifica
Android; App Categorisation; LLM; Clustering; Embeddings
no
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Loi, Pierangelo; Soi, Diego; Regano, Leonardo; Giacinto, Giorgio
273
4
4.1 Contributo in Atti di convegno
open
info:eu-repo/semantics/conferencePaper
Files in This Item:
File Size Format  
LLM-Based Auto-Categorization of Android Applications.pdf

open access

Type: versione editoriale
Size 1.23 MB
Format Adobe PDF
1.23 MB Adobe PDF View/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Questionnaire and social

Share on:
Impostazioni cookie