Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection

06/07/2022
by   Michael D. Wong, et al.
0

Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malware that will never exist in large quantities. We present MARVOLO, a binary mutator that programmatically grows malware (and benign) datasets in a manner that boosts the accuracy of ML-driven malware detectors. MARVOLO employs semantics-preserving code transformations that mimic the alterations that malware authors and defensive benign developers routinely make in practice , allowing us to generate meaningful augmented data. Crucially, semantics-preserving transformations also enable MARVOLO to safely propagate labels from original to newly-generated data samples without mandating expensive reverse engineering of binaries. Further, MARVOLO embeds several key optimizations that keep costs low for practitioners by maximizing the density of diverse data samples generated within a given time (or resource) budget. Experiments using wide-ranging commercial malware datasets and a recent ML-driven malware detector show that MARVOLO boosts accuracies by up to 5 while operating on only a small fraction (15

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/03/2023

Analysis of Label-Flip Poisoning Attack on Machine Learning Based Malware Detector

With the increase in machine learning (ML) applications in different dom...
research
12/05/2021

On Impact of Semantically Similar Apps in Android Malware Datasets

Malware authors reuse the same program segments found in other applicati...
research
08/30/2021

ML-based IoT Malware Detection Under Adversarial Settings: A Systematic Evaluation

The rapid growth of the Internet of Things (IoT) devices is paralleled b...
research
09/11/2020

Semantic-preserving Reinforcement Learning Attack Against Graph Neural Networks for Malware Detection

To address the costs of reverse engineering and signature extraction, ad...
research
10/05/2020

Data Augmentation Based Malware Detection using Convolutional Neural Networks

Recently, cyber-attacks have been extensively seen due to the everlastin...
research
08/30/2022

AVMiner: Expansible and Semantic-Preserving Anti-Virus Labels Mining Method

With the increase in the variety and quantity of malware, there is an ur...

Please sign up or login with your details

Forgot password? Click here to reset