Riccardo Cicilloni
Exposing the Cracks: A Case Study on the Quality of Public Linux Malware Data Sets
Alessandro SannaFirst
;Leonardo Regano;Davide MaiorcaPenultimate
;Giorgio GiacintoLast
2025-01-01
Abstract
Machine learning is extensively used for malware detection due to its accuracy, scalability, and adaptability. However, the effectiveness of ML models heavily depends on the quality of the datasets used for training and testing. This study evaluates popular public datasets for malware ARM ELF binaries from MalwareBazaar and VirusShare, complemented with benign binaries from Debian repositories. Using mnemonic frequency analysis, we found that these datasets lack the diversity found in Android or Windows. Using only the frequency of a single assembly mnemonic, we can distinguish the malware from the goodware with a balanced accuracy of 78%, and using three mnemonics, we achieved a balanced accuracy of 99%. We finally derive conclusions on the current state of Linux publicly available malware.| File | Size | Format | |
|---|---|---|---|
| paper33.pdf open access
Description: Versione Editoriale
Type: versione editoriale
Size 1.08 MB
Format Adobe PDF
|
1.08 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
University of Cagliari