Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks

by   Jonathan Rusert, et al.

Deep learning (DL) is being used extensively for text classification. However, researchers have demonstrated the vulnerability of such classifiers to adversarial attacks. Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact. State-of-the-art (SOTA) attack algorithms follow the general principle of making minimal changes to the text so as to not jeopardize semantics. Taking advantage of this we propose a novel and intuitive defense strategy called Sample Shielding. It is attacker and classifier agnostic, does not require any reconfiguration of the classifier or external resources and is simple to implement. Essentially, we sample subsets of the input text, classify them and summarize these into a final decision. We shield three popular DL text classifiers with Sample Shielding, test their resilience against four SOTA attackers across three datasets in a realistic threat setting. Even when given the advantage of knowing about our shielding strategy the adversary's attack success rate is <=10 maintains near original accuracy when applied to original texts. Crucially, we show that the `make minimal changes' approach of SOTA attackers leads to critical vulnerabilities that can be defended against with an intuitive sampling strategy.


page 1

page 2

page 3

page 4


Identifying Adversarial Attacks on Text Classifiers

The landscape of adversarial attacks against text classifiers continues ...

TCAB: A Large-Scale Text Classification Attack Benchmark

We introduce the Text Classification Attack Benchmark (TCAB), a dataset ...

Mutation-Based Adversarial Attacks on Neural Text Detectors

Neural text detectors aim to decide the characteristics that distinguish...

Preserving Semantics in Textual Adversarial Attacks

Adversarial attacks in NLP challenge the way we look at language models....

SEPP: Similarity Estimation of Predicted Probabilities for Defending and Detecting Adversarial Text

There are two cases describing how a classifier processes input text, na...

Bias Busters: Robustifying DL-based Lithographic Hotspot Detectors Against Backdooring Attacks

Deep learning (DL) offers potential improvements throughout the CAD tool...

A Novel Data Encryption Method Inspired by Adversarial Attacks

Due to the advances of sensing and storage technologies, a tremendous am...

Please sign up or login with your details

Forgot password? Click here to reset