Poisoned classifiers are not only backdoored, they are fundamentally broken

10/18/2020
by   Mingjie Sun, et al.
8

Under a commonly-studied "backdoor" poisoning attack against classification models, an attacker adds a small "trigger" to a subset of the training data, such that the presence of this trigger at test time causes the classifier to always predict some target class. It is often implicitly assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger. In this paper, we show empirically that this view of backdoored classifiers is fundamentally incorrect. We demonstrate that anyone with access to the classifier, even without access to any original training data or trigger, can construct several alternative triggers that are as effective or more so at eliciting the target class at test time. We construct these alternative triggers by first generating adversarial examples for a smoothed version of the classifier, created with a recent process called Denoised Smoothing, and then extracting colors or cropped portions of adversarial images. We demonstrate the effectiveness of our attack through extensive experiments on ImageNet and TrojAI datasets, including a user study which demonstrates that our method allows users to easily determine the existence of such backdoors in existing poisoned classifiers. Furthermore, we demonstrate that our alternative triggers can in fact look entirely different from the original trigger, highlighting that the backdoor actually learned by the classifier differs substantially from the trigger image itself. Thus, we argue that there is no such thing as a "secret" backdoor in poisoned classifiers: poisoning a classifier invites attacks not just by the party that possesses the trigger, but from anyone with access to the classifier. Code is available at https://github.com/locuslab/breaking-poisoned-classifier.

READ FULL TEXT

page 4

page 7

page 13

page 15

page 16

page 17

page 18

page 19

research
03/04/2020

Black-box Smoothing: A Provable Defense for Pretrained Classifiers

We present a method for provably defending any pretrained image classifi...
research
06/16/2022

Backdoor Attacks on Vision Transformers

Vision Transformers (ViT) have recently demonstrated exemplary performan...
research
05/22/2019

Learning to Confuse: Generating Training Time Adversarial Data with Auto-Encoder

In this work, we consider one challenging training time attack by modify...
research
02/20/2020

A Bayes-Optimal View on Adversarial Examples

The ability to fool modern CNN classifiers with tiny perturbations of th...
research
03/01/2023

Single Image Backdoor Inversion via Robust Smoothed Classifiers

Backdoor inversion, the process of finding a backdoor trigger inserted i...
research
04/12/2021

A Backdoor Attack against 3D Point Cloud Classifiers

Vulnerability of 3D point cloud (PC) classifiers has become a grave conc...
research
02/21/2023

Characterizing the Optimal 0-1 Loss for Multi-class Classification with a Test-time Attacker

Finding classifiers robust to adversarial examples is critical for their...

Please sign up or login with your details

Forgot password? Click here to reset