Giovanna Maria Ghiani

Exposing the Cracks: A Case Study on the Quality of Public Linux Malware Data Sets

Alessandro Sanna
Primo
;
Leonardo Regano;Davide Maiorca
Penultimo
;
Giorgio Giacinto
Ultimo
2025-01-01

Abstract

Machine learning is extensively used for malware detection due to its accuracy, scalability, and adaptability. However, the effectiveness of ML models heavily depends on the quality of the datasets used for training and testing. This study evaluates popular public datasets for malware ARM ELF binaries from MalwareBazaar and VirusShare, complemented with benign binaries from Debian repositories. Using mnemonic frequency analysis, we found that these datasets lack the diversity found in Android or Windows. Using only the frequency of a single assembly mnemonic, we can distinguish the malware from the goodware with a balanced accuracy of 78%, and using three mnemonics, we achieved a balanced accuracy of 99%. We finally derive conclusions on the current state of Linux publicly available malware.
2025
Inglese
Proceedings of the Joint National Conference on Cybersecurity (ITASEC & SERICS 2025)
3962
1
10
10
Joint National Conference on Cybersecurity (ITASEC & SERICS 2025)
Comitato scientifico
3-8 Febbraio 2025
Bologna
scientifica
Malware Analysis; Data Set Quality; Binary Analysis; Linux Malware; Machine Learning; Assembly Mnemonics
no
4 Contributo in Atti di Convegno (Proceeding)::4.1 Contributo in Atti di convegno
Sanna, Alessandro; Canavese, Daniele; Regano, Leonardo; Maiorca, Davide; Giacinto, Giorgio
273
5
4.1 Contributo in Atti di convegno
open
info:eu-repo/semantics/conferencePaper
File in questo prodotto:
File Dimensione Formato  
paper33.pdf

accesso aperto

Descrizione: Versione Editoriale
Tipologia: versione editoriale (VoR)
Dimensione 1.08 MB
Formato Adobe PDF
1.08 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Questionario e social

Condividi su:
Impostazioni cookie