A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content

04/13/2018
by   Joshua Saxe, et al.
0

Malicious web content is a serious problem on the Internet today. In this paper we propose a deep learning approach to detecting malevolent web pages. While past work on web content detection has relied on syntactic parsing or on emulation of HTML and Javascript to extract features, our approach operates directly on a language-agnostic stream of tokens extracted directly from static HTML files with a simple regular expression. This makes it fast enough to operate in high-frequency data contexts like firewalls and web proxies, and allows it to avoid the attack surface exposure of complex parsing and emulation code. Unlike well-known approaches such as bag-of-words models, which ignore spatial information, our neural network examines content at hierarchical spatial scales, allowing our model to capture locality and yielding superior accuracy compared to bag-of-words baselines. Our proposed architecture achieves a 97.5 small-batched web pages at a rate of over 100 per second on commodity hardware. The speed and accuracy of our approach makes it appropriate for deployment to endpoints, firewalls, and web proxies.

READ FULL TEXT
research
03/06/2022

Detection of Change Frequency in Web Pages to Optimize Server-based Scheduling

The Internet at present has become vast and dynamic with the ever increa...
research
03/06/2022

Change detection optimization in frequently changing web pages

Web pages at present have become dynamic and frequently changing, compar...
research
11/06/2020

Look Before You Leap: Detecting Phishing Web Pages by Exploiting Raw URL And HTML Characteristics

Cybercriminals resort to phishing as a simple and cost-effective medium ...
research
04/22/2020

Boilerplate Removal using a Neural Sequence Labeling Model

The extraction of main content from web pages is an important task for n...
research
02/20/2023

Poisoning Web-Scale Training Datasets is Practical

Deep learning models are often trained on distributed, webscale datasets...
research
04/01/2019

ScriptNet: Neural Static Analysis for Malicious JavaScript Detection

Malicious scripts are an important computer infection threat vector in t...
research
10/18/1999

PIPE: Personalizing Recommendations via Partial Evaluation

It is shown that personalization of web content can be advantageously vi...

Please sign up or login with your details

Forgot password? Click here to reset