MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

11/29/2021
by   Robert J. Joyce, et al.
0

Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10 accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2023

Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

Many studies have proposed machine-learning (ML) models for malware dete...
research
05/26/2022

A Large Scale Study and Classification of VirusTotal Reports on Phishing and Malware URLs

VirusTotal (VT) provides aggregated threat intelligence on various entit...
research
04/02/2019

MalPaCA: Malware Packet Sequence Clustering and Analysis

Malware family characterization is a challenging problem because ground-...
research
06/18/2020

AVClass2: Massive Malware Tag Extraction from AV Labels

Tags can be used by malware repositories and analysis services to enable...
research
01/15/2021

Identifying Authorship Style in Malicious Binaries: Techniques, Challenges Datasets

Attributing a piece of malware to its creator typically requires threat ...
research
02/22/2018

Microsoft Malware Classification Challenge

The Microsoft Malware Classification Challenge was announced in 2015 alo...
research
08/30/2022

AVMiner: Expansible and Semantic-Preserving Anti-Virus Labels Mining Method

With the increase in the variety and quantity of malware, there is an ur...

Please sign up or login with your details

Forgot password? Click here to reset