Detecting Adversarial Perturbations Through Spatial Behavior in Activation Spaces

by   Ziv Katzir, et al.

Neural network based classifiers are still prone to manipulation through adversarial perturbations. State of the art attacks can overcome most of the defense or detection mechanisms suggested so far, and adversaries have the upper hand in this arms race. Adversarial examples are designed to resemble the normal input from which they were constructed, while triggering an incorrect classification. This basic design goal leads to a characteristic spatial behavior within the context of Activation Spaces, a term coined by the authors to refer to the hyperspaces formed by the activation values of the network's layers. Within the output of the first layers of the network, an adversarial example is likely to resemble normal instances of the source class, while in the final layers such examples will diverge towards the adversary's target class. The steps below enable us to leverage this inherent shift from one class to another in order to form a novel adversarial example detector. We construct Euclidian spaces out of the activation values of each of the deep neural network layers. Then, we induce a set of k-nearest neighbor classifiers (k-NN), one per activation space of each neural network layer, using the non-adversarial examples. We leverage those classifiers to produce a sequence of class labels for each nonperturbed input sample and estimate the a priori probability for a class label change between one activation space and another. During the detection phase we compute a sequence of classification labels for each input using the trained classifiers. We then estimate the likelihood of those classification sequences and show that adversarial sequences are far less likely than normal ones. We evaluated our detection method against the state of the art C&W attack method, using two image classification datasets (MNIST, CIFAR-10) reaching an AUC 0f 0.95 for the CIFAR-10 dataset.


page 9

page 10


When Explainability Meets Adversarial Learning: Detecting Adversarial Examples using SHAP Signatures

State-of-the-art deep neural networks (DNNs) are highly effective in sol...

Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors

Deep neural networks (DNNs) are notorious for their vulnerability to adv...

Enhancing Transformation-based Defenses against Adversarial Examples with First-Order Perturbations

Studies show that neural networks are susceptible to adversarial attacks...

Divide-and-Conquer Adversarial Detection

The vulnerabilities of deep neural networks against adversarial examples...

Interpreting Adversarial Examples by Activation Promotion and Suppression

It is widely known that convolutional neural networks (CNNs) are vulnera...

Frequency Centric Defense Mechanisms against Adversarial Examples

Adversarial example (AE) aims at fooling a Convolution Neural Network by...

HAct: Out-of-Distribution Detection with Neural Net Activation Histograms

We propose a simple, efficient, and accurate method for detecting out-of...

Please sign up or login with your details

Forgot password? Click here to reset