Improving Counterfactual Generation for Fair Hate Speech Detection

08/03/2021
by   Aida Mostafazadeh Davani, et al.
0

Bias mitigation approaches reduce models' dependence on sensitive features of data, such as social group tokens (SGTs), resulting in equal predictions across the sensitive features. In hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups, as hate speech can contain stereotypical language specific to each SGT. Here, to take the specific language about each SGT into account, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the SGTs. Our method evaluates the similarity in sentence likelihoods (via pre-trained language models) among counterfactuals, to treat SGTs equally only within interchangeable contexts. By applying logit pairing to equalize outcomes on the restricted set of counterfactuals for each instance, we improve fairness metrics while preserving model performance on hate speech detection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Fair Hate Speech Detection through Evaluation of Social Group Counterfactuals

Approaches for mitigating bias in supervised models are designed to redu...
research
11/08/2019

Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

Recent improvements in large-scale language models have driven progress ...
research
02/16/2023

Counterfactual Fair Opportunity: Measuring Decision Model Fairness with Counterfactual Reasoning

The increasing application of Artificial Intelligence and Machine Learni...
research
03/17/2022

Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

Natural Language Processing (NLP) models risk overfitting to specific te...
research
02/08/2022

Counterfactual Multi-Token Fairness in Text Classification

The counterfactual token generation has been limited to perturbing only ...
research
06/28/2022

Flexible text generation for counterfactual fairness probing

A common approach for testing fairness issues in text-based classifiers ...
research
10/19/2022

How Hate Speech Varies by Target Identity: A Computational Analysis

This paper investigates how hate speech varies in systematic ways accord...

Please sign up or login with your details

Forgot password? Click here to reset