BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing

by   Jiali Wei, et al.

Deep neural networks (DNNs) and natural language processing (NLP) systems have developed rapidly and have been widely used in various real-world fields. However, they have been shown to be vulnerable to backdoor attacks. Specifically, the adversary injects a backdoor into the model during the training phase, so that input samples with backdoor triggers are classified as the target class. Some attacks have achieved high attack success rates on the pre-trained language models (LMs), but there have yet to be effective defense methods. In this work, we propose a defense method based on deep model mutation testing. Our main justification is that backdoor samples are much more robust than clean samples if we impose random mutations on the LMs and that backdoors are generalizable. We first confirm the effectiveness of model mutation testing in detecting backdoor samples and select the most appropriate mutation operators. We then systematically defend against three extensively studied backdoor attack levels (i.e., char-level, word-level, and sentence-level) by detecting backdoor samples. We also make the first attempt to defend against the latest style-level backdoor attacks. We evaluate our approach on three benchmark datasets (i.e., IMDB, Yelp, and AG news) and three style transfer datasets (i.e., SST-2, Hate-speech, and AG news). The extensive experimental results demonstrate that our approach can detect backdoor samples more efficiently and accurately than the three state-of-the-art defense approaches.


BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models

Pre-trained Natural Language Processing (NLP) models can be easily adapt...

ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP

Backdoor attacks have emerged as a prominent threat to natural language ...

ONION: A Simple and Effective Defense Against Textual Backdoor Attacks

Backdoor attacks, which are a kind of emergent training-time threat to d...

RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

Backdoor attacks, which maliciously control a well-trained model's outpu...

Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing

Deep neural networks (DNN) have been shown to be useful in a wide range ...

Don't FREAK Out: A Frequency-Inspired Approach to Detecting Backdoor Poisoned Samples in DNNs

In this paper we investigate the frequency sensitivity of Deep Neural Ne...

Defending Against Backdoor Attacks by Layer-wise Feature Analysis

Training deep neural networks (DNNs) usually requires massive training d...

Please sign up or login with your details

Forgot password? Click here to reset