URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection

02/09/2018
by   Hung Le, et al.
0

Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to handle unseen features and generalize to test data. To address these challenges, we propose URLNet, an end-to-end deep learning framework to learn a nonlinear URL embedding for Malicious URL Detection directly from the URL. Specifically, we apply Convolutional Neural Networks to both characters and words of the URL String to learn the URL embedding in a jointly optimized framework. This approach allows the model to capture several types of semantic information, which was not possible by the existing models. We also propose advanced word-embeddings to solve the problem of too many rare words observed in this task. We conduct extensive experiments on a large-scale dataset and show a significant performance gain over existing methods. We also conduct ablation studies to evaluate the performance of various components of URLNet.

READ FULL TEXT

page 4

page 11

page 12

research
10/14/2019

Using Lexical Features for Malicious URL Detection – A Machine Learning Approach

Malicious websites are responsible for a majority of the cyber-attacks a...
research
04/04/2023

A Survey on Contextualised Semantic Shift Detection

Semantic Shift Detection (SSD) is the task of identifying, interpreting,...
research
07/31/2020

Evaluating Semantic Interaction on Word Embeddings via Simulation

Semantic interaction (SI) attempts to learn the user's cognitive intents...
research
04/07/2018

A Machine Learning Approach To Prevent Malicious Calls Over Telephony Networks

Malicious calls, i.e., telephony spams and scams, have been a long-stand...
research
04/10/2021

Op2Vec: An Opcode Embedding Technique and Dataset Design for End-to-End Detection of Android Malware

Android is one of the leading operating systems for smart phones in term...
research
09/02/2018

Neural Character-based Composition Models for Abuse Detection

The advent of social media in recent years has fed into some highly unde...
research
03/18/2021

deepBF: Malicious URL detection using Learned Bloom Filter and Evolutionary Deep Learning

Malicious URL detection is an emerging research area due to continuous m...

Please sign up or login with your details

Forgot password? Click here to reset