Towards Robust Toxic Content Classification

12/14/2019
by   Keita Kurita, et al.
0

Toxic content detection aims to identify content that can offend or harm its recipients. Automated classifiers of toxic content need to be robust against adversaries who deliberately try to bypass filters. We propose a method of generating realistic model-agnostic attacks using a lexicon of toxic tokens, which attempts to mislead toxicity classifiers by diluting the toxicity signal either by obfuscating toxic tokens through character-level perturbations, or by injecting non-toxic distractor tokens. We show that these realistic attacks reduce the detection recall of state-of-the-art neural toxicity detectors, including those using ELMo and BERT, by more than 50 two approaches for defending against such attacks. First, we examine the effect of training on synthetically noised data. Second, we propose the Contextual Denoising Autoencoder (CDAE): a method for learning robust representations that uses character-level and contextual information to denoise perturbed tokens. We show that the two approaches are complementary, improving robustness to both character-level perturbations and distractors, recovering a considerable portion of the lost accuracy. Finally, we analyze the robustness characteristics of the most competitive methods and outline practical considerations for improving toxicity detectors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2022

White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

In this work, we evaluate the adversarial robustness of BERT models trai...
research
01/17/2022

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

A limited amount of studies investigates the role of model-agnostic adve...
research
10/14/2021

Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models

Recently, NLP models have achieved remarkable progress across a variety ...
research
01/11/2020

Exploring and Improving Robustness of Multi Task Deep Neural Networks via Domain Agnostic Defenses

In this paper, we explore the robustness of the Multi-Task Deep Neural N...
research
06/06/2022

What do tokens know about their characters and how do they know it?

Pre-trained language models (PLMs) that use subword tokenization schemes...
research
08/20/2022

Evaluating Out-of-Distribution Detectors Through Adversarial Generation of Outliers

A reliable evaluation method is essential for building a robust out-of-d...
research
01/02/2022

On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations

Text classification is a fundamental Natural Language Processing task th...

Please sign up or login with your details

Forgot password? Click here to reset