Machine Learning Transferability For Malware Detection
Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS (EB) and EMBER + BODMAS + ERMDS (EBR). Regarding model evaluation, both EB and EBR models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EB setup. The findings indicate that compact boosting static detectors are applicable for on-host use, but require a careful analysis of how PE obfuscation techniques affect the feature distributions of training datasets and during model inference.
