The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning

11/01/2022
by   Virat Shejwalkar, et al.
0

Semi-supervised machine learning (SSL) is gaining popularity as it reduces the cost of training ML models. It does so by using very small amounts of (expensive, well-inspected) labeled data and large amounts of (cheap, non-inspected) unlabeled data. SSL has shown comparable or even superior performances compared to conventional fully-supervised ML techniques. In this paper, we show that the key feature of SSL that it can learn from (non-inspected) unlabeled data exposes SSL to strong poisoning attacks. In fact, we argue that, due to its reliance on non-inspected unlabeled data, poisoning is a much more severe problem in SSL than in conventional fully-supervised ML. Specifically, we design a backdoor poisoning attack on SSL that can be conducted by a weak adversary with no knowledge of target SSL pipeline. This is unlike prior poisoning attacks in fully-supervised settings that assume strong adversaries with practically-unrealistic capabilities. We show that by poisoning only 0.2 misclassification of more than 80 adversary's backdoor trigger). Our attacks remain effective across twenty combinations of benchmark datasets and SSL algorithms, and even circumvent the state-of-the-art defenses against backdoor attacks. Our work raises significant concerns about the practical utility of existing SSL algorithms.

READ FULL TEXT

page 1

page 5

page 6

page 7

page 12

page 15

page 16

page 17

research
05/04/2021

Poisoning the Unlabeled Dataset of Semi-Supervised Learning

Semi-supervised machine learning models learn from a (small) set of labe...
research
12/18/2019

RealMix: Towards Realistic Semi-Supervised Deep Learning Algorithms

Semi-Supervised Learning (SSL) algorithms have shown great potential in ...
research
01/01/2023

Trojaning semi-supervised learning model via poisoning wild images on the web

Wild images on the web are vulnerable to backdoor (also called trojan) p...
research
05/28/2021

Noised Consistency Training for Text Summarization

Neural abstractive summarization methods often require large quantities ...
research
05/25/2023

Persistent Laplacian-enhanced Algorithm for Scarcely Labeled Data Classification

The success of many machine learning (ML) methods depends crucially on h...
research
03/15/2012

Parameter-Free Spectral Kernel Learning

Due to the growing ubiquity of unlabeled data, learning with unlabeled d...
research
02/24/2020

Semi-Supervised Speech Recognition via Local Prior Matching

For sequence transduction tasks like speech recognition, a strong struct...

Please sign up or login with your details

Forgot password? Click here to reset