NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online

03/18/2023
by   Yiran Ye, et al.
0

Online texts with toxic content are a threat in social media that might cause cyber harassment. Although many platforms applied measures, such as machine learning-based hate-speech detection systems, to diminish their effect, those toxic content publishers can still evade the system by modifying the spelling of toxic words. Those modified words are also known as human-written text perturbations. Many research works developed certain techniques to generate adversarial samples to help the machine learning models obtain the ability to recognize those perturbations. However, there is still a gap between those machine-generated perturbations and human-written perturbations. In this paper, we introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We also recruited a group of workers to evaluate the quality of this test set and dropped low-quality samples. Meanwhile, to check if our perturbation can be normalized to its clean version, we applied spell corrector algorithms on this dataset. Finally, we test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective.

READ FULL TEXT
research
03/19/2022

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense

We proposes a novel algorithm, ANTHRO, that inductively extracts over 60...
research
01/16/2023

CRYPTEXT: Database and Interactive Toolkit of Human-Written Text Perturbations in the Wild

User-generated textual contents on the Internet are often noisy, erroneo...
research
01/23/2020

On the human evaluation of audio adversarial examples

Human-machine interaction is increasingly dependent on speech communicat...
research
03/17/2022

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Toxic language detection systems often falsely flag text that contains m...
research
10/16/2020

Mischief: A Simple Black-Box Attack Against Transformer Architectures

We introduce Mischief, a simple and lightweight method to produce a clas...
research
09/09/2016

Harassment detection: a benchmark on the #HackHarassment dataset

Online harassment has been a problem to a greater or lesser extent since...
research
01/29/2020

A4 : Evading Learning-based Adblockers

Efforts by online ad publishers to circumvent traditional ad blockers to...

Please sign up or login with your details

Forgot password? Click here to reset