On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware

by   Molly Buchanan, et al.

Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synthetically produced data containing fabricated or merged network traffic is of limited value as it is easily distinguishable from real traffic by even simple machine learning (ML) algorithms. Real network data is preferable, but while ubiquitous is broadly both sensitive and lacking in ground truth labels, limiting its utility for ML research. This paper presents a multi-faceted approach to generating a data set of labeled malicious connections embedded within anonymized network traffic collected from large production networks. Real-world malware is defanged and introduced to simulated, secured nodes within those networks to generate realistic traffic while maintaining sufficient isolation to protect real data and infrastructure. Network sensor data, including this embedded malware traffic, is collected at a network edge and anonymized for research use. Network traffic was collected and produced in accordance with the aforementioned methods at two major educational institutions. The result is a highly realistic, long term, multi-institution data set with embedded data labels spanning over 1.5 trillion connections and over a petabyte of sensor log data. The usability of this data set is demonstrated by its utility to our artificial intelligence and machine learning (AI/ML) research program.


page 1

page 2

page 3


5G-NIDD: A Comprehensive Network Intrusion Detection Dataset Generated over 5G Wireless Network

With a plethora of new connections, features, and services introduced, t...

WebEye - Automated Collection of Malicious HTTP Traffic

With malware detection techniques increasingly adopting machine learning...

Harnessing the Speed and Accuracy of Machine Learning to Advance Cybersecurity

As cyber attacks continue to increase in frequency and sophistication, d...

Quantum Machine Learning for Malware Classification

In a context of malicious software detection, machine learning (ML) is w...

Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic

Machine learning (ML) is promising in accurately detecting malicious flo...

Using system context information to complement weakly labeled data

Real-world datasets collected with sensor networks often contain incompl...

Learning When to Say Goodbye: What Should be the Shelf Life of an Indicator of Compromise?

Indicators of Compromise (IOCs), such as IP addresses, file hashes, and ...

Please sign up or login with your details

Forgot password? Click here to reset