On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware

04/20/2021
by   Molly Buchanan, et al.
0

Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is easily distinguishable from real traffic by even simple machine learning (ML) algorithms. Real network data is preferable, but while ubiquitous is broadly both sensitive and lacking in ground truth labels, limiting its utility for ML research. This paper presents a multi-faceted approach to generating a data set of labeled malicious connections embedded within anonymized network traffic collected from large production networks. Real-world malware is defanged and introduced to simulated, secured nodes within those networks to generate realistic traffic while maintaining sufficient isolation to protect real data and infrastructure. Network sensor data, including this embedded malware traffic, is collected at a network edge and anonymized for research use. Network traffic was collected and produced in accordance with the aforementioned methods at two major educational institutions. The result is a highly realistic, long term, multi-institution data set with embedded data labels spanning over 1.5 trillion connections and over a petabyte of sensor log data. The usability of this data set is demonstrated by its utility to our artificial intelligence and machine learning (AI/ML) research program.

READ FULL TEXT

page 1

page 2

page 3

research
12/02/2022

5G-NIDD: A Comprehensive Network Intrusion Detection Dataset Generated over 5G Wireless Network

With a plethora of new connections, features, and services introduced, t...
research
02/16/2018

WebEye - Automated Collection of Malicious HTTP Traffic

With malware detection techniques increasingly adopting machine learning...
research
02/24/2023

Harnessing the Speed and Accuracy of Machine Learning to Advance Cybersecurity

As cyber attacks continue to increase in frequency and sophistication, d...
research
05/09/2023

Quantum Machine Learning for Malware Classification

In a context of malicious software detection, machine learning (ML) is w...
research
09/09/2023

Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic

Machine learning (ML) is promising in accurately detecting malicious flo...
research
07/19/2021

Using system context information to complement weakly labeled data

Real-world datasets collected with sensor networks often contain incompl...
research
07/31/2023

Learning When to Say Goodbye: What Should be the Shelf Life of an Indicator of Compromise?

Indicators of Compromise (IOCs), such as IP addresses, file hashes, and ...

Please sign up or login with your details

Forgot password? Click here to reset