A Dictionary-based Approach to Racism Detection in Dutch Social Media

by   Stéphan Tulkens, et al.

We present a dictionary-based approach to racism detection in Dutch social media comments, which were retrieved from two public Belgian social media sites likely to attract racist reactions. These comments were labeled as racist or non-racist by multiple annotators. For our approach, three discourse dictionaries were created: first, we created a dictionary by retrieving possibly racist and more neutral terms from the training data, and then augmenting these with more general words to remove some bias. A second dictionary was created through automatic expansion using a word2vec model trained on a large corpus of general Dutch text. Finally, a third dictionary was created by manually filtering out incorrect expansions. We trained multiple Support Vector Machines, using the distribution of words over the different categories in the dictionaries as features. The best-performing model used the manually cleaned dictionary and obtained an F-score of 0.46 for the racist class on a test set consisting of unseen Dutch comments, retrieved from the same sites used for the training set. The automated expansion of the dictionary only slightly boosted the model's performance, and this increase in performance was not statistically significant. The fact that the coverage of the expanded dictionaries did increase indicates that the words that were automatically added did occur in the corpus, but were not able to meaningfully impact performance. The dictionaries, code, and the procedure for requesting the corpus are available at: https://github.com/clips/hades


page 1

page 2

page 3

page 4


BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

Social media platforms and online streaming services have spawned a new ...

Automatic generation of a large dictionary with concreteness/abstractness ratings based on a small human dictionary

Concrete/abstract words are used in a growing number of psychological an...

Monitoring Targeted Hate in Online Environments

Hateful comments, swearwords and sometimes even death threats are becomi...

FuzzingDriver: the Missing Dictionary to Increase Code Coverage in Fuzzers

We propose a tool, called FuzzingDriver, to generate dictionary tokens f...

Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects

Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunc...

A Comprehensive Dictionary and Term Variation Analysis for COVID-19 and SARS-CoV-2

The number of unique terms in the scientific literature used to refer to...

Automatic Construction of Sememe Knowledge Bases via Dictionaries

A sememe is defined as the minimum semantic unit in linguistics. Sememe ...

Please sign up or login with your details

Forgot password? Click here to reset