Achieving Model Robustness through Discrete Adversarial Training

by   Maor Ivgi, et al.

Discrete adversarial attacks are symbolic perturbations to a language input that preserve the output label but lead to a prediction error. While such attacks have been extensively explored for the purpose of evaluating model robustness, their utility for improving robustness has been limited to offline augmentation only, i.e., given a trained model, attacks are used to generate perturbed (adversarial) examples, and the model is re-trained exactly once. In this work, we address this gap and leverage discrete attacks for online augmentation, where adversarial examples are generated at every step, adapting to the changing nature of the model. We also consider efficient attacks based on random sampling, that unlike prior work are not based on expensive search-based procedures. As a second contribution, we provide a general formulation for multiple search-based attacks from past work, and propose a new attack based on best-first search. Surprisingly, we find that random sampling leads to impressive gains in robustness, outperforming the commonly-used offline augmentation, while leading to a speedup at training time of  10x. Furthermore, online augmentation with search-based attacks justifies the higher training cost, significantly improving robustness on three datasets. Last, we show that our proposed algorithm substantially improves robustness compared to prior methods.


page 1

page 2

page 3

page 4


Self-Supervised Contrastive Learning with Adversarial Perturbations for Robust Pretrained Language Models

This paper improves the robustness of the pretrained language model BERT...

JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks

It has been demonstrated that very simple attacks can fool highly-sophis...

Addressing Neural Network Robustness with Mixup and Targeted Labeling Adversarial Training

Despite their performance, Artificial Neural Networks are not reliable e...

Impact of Adversarial Training on Robustness and Generalizability of Language Models

Adversarial training is widely acknowledged as the most effective defens...

Discretization based Solutions for Secure Machine Learning against Adversarial Attacks

Adversarial examples are perturbed inputs that are designed (from a deep...

Defense Against Adversarial Attacks Using Feature Scattering-based Adversarial Training

We introduce a feature scattering-based adversarial training approach fo...

Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes

As humans, we inherently perceive images based on their predominant feat...

Please sign up or login with your details

Forgot password? Click here to reset