Backdoor Attacks for In-Context Learning with Language Models

07/27/2023
by   Nikhil Kandpal, et al.
0

Because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. We show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. We design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. Finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone.

READ FULL TEXT
research
02/21/2023

kNN-Adapter: Efficient Domain Adaptation for Black-Box Language Models

Fine-tuning a language model on a new domain is standard practice for do...
research
01/06/2023

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

With tools like GitHub Copilot, automatic code suggestion is no longer a...
research
02/08/2023

Training-free Lexical Backdoor Attacks on Language Models

Large-scale language models have achieved tremendous success across vari...
research
06/18/2021

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

Language models can generate harmful and biased outputs and exhibit unde...
research
12/09/2021

Spinning Language Models for Propaganda-As-A-Service

We investigate a new threat to neural sequence-to-sequence (seq2seq) mod...
research
08/16/2023

Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models

Large language models (LLMs), such as ChatGPT, have emerged with astonis...
research
09/21/2023

A Chinese Prompt Attack Dataset for LLMs with Evil Content

Large Language Models (LLMs) present significant priority in text unders...

Please sign up or login with your details

Forgot password? Click here to reset