Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic

by   Yuqi Qing, et al.
George Mason University
Tsinghua University

Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training data, they suffer degraded performance. In this paper, we aim at addressing a real-world low-quality training dataset problem, namely, detecting encrypted malicious traffic generated by continuously evolving malware. We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space, where normal data is tightly distributed in a certain area and the malicious data is scattered over the entire feature space to augment training data for model training. RAPIER includes two pre-processing modules to convert traffic into feature vectors and correct label noises. We evaluate our system on two public datasets and one combined dataset. With 1000 samples and 45 0.776, and 0.855, respectively, achieving average improvements of 352.6 284.3 evaluate RAPIER with a real-world dataset obtained from a security enterprise. RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5


Feature Mining for Encrypted Malicious Traffic Detection with Deep Learning and Other Machine Learning Algorithms

The popularity of encryption mechanisms poses a great challenge to malic...

Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification

With the increasing prevalence of encrypted network traffic, cyber secur...

JABBERWOCK: A Tool for WebAssembly Dataset Generation and Its Application to Malicious Website Detection

Machine learning is often used for malicious website detection, but an a...

Datasets are not Enough: Challenges in Labeling Network Traffic

In contrast to previous surveys, the present work is not focused on revi...

Deep traffic light detection by overlaying synthetic context on arbitrary natural images

Deep neural networks come as an effective solution to many problems asso...

On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware

Research and development of techniques which detect or remediate malicio...

Please sign up or login with your details

Forgot password? Click here to reset