Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection

10/05/2020
by   Gaurav Arora, et al.
0

This paper describes the system submitted to Dravidian-Codemix-HASOC2020: Hate Speech and Offensive Content Identification in Dravidian languages (Tamil-English and Malayalam-English). The task aims to identify offensive language in code-mixed dataset of comments/posts in Dravidian languages collected from social media. We participated in both Sub-task A, which aims to identify offensive content in mixed-script (mixture of Native and Roman script) and Sub-task B, which aims to identify offensive content in Roman script, for Dravidian languages. In order to address these tasks, we proposed pre-training ULMFiT on synthetically generated code-mixed data, generated by modelling code-mixed data generation as a Markov process using Markov chains. Our model achieved 0.88 weighted F1-score for code-mixed Tamil-English language in Sub-task B and got 2nd rank on the leader-board. Additionally, our model achieved 0.91 weighted F1-score (4th Rank) for mixed-script Malayalam-English in Sub-task A and 0.74 weighted F1-score (5th Rank) for code-mixed Malayalam-English language in Sub-task B.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2021

PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for Offensive Language Identification in Tanglish

This paper describes the system submitted to Dravidian-Codemix-HASOC2021...
research
11/01/2020

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

This paper describes the WLV-RIT entry to the Hate Speech and Offensive ...
research
07/29/2021

IIITG-ADBU@HASOC-Dravidian-CodeMix-FIRE2020: Offensive Content Detection in Code-Mixed Dravidian Text

This paper presents the results obtained by our SVM and XLM-RoBERTa base...
research
08/27/2021

Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Social media has effectively become the prime hub of communication and d...
research
08/10/2021

Hope Speech detection in under-resourced Kannada language

Numerous methods have been developed to monitor the spread of negativity...
research
10/17/2020

CUSATNLP@HASOC-Dravidian-CodeMix-FIRE2020:Identifying Offensive Language from ManglishTweets

With the popularity of social media, communications through blogs, Faceb...
research
04/08/2022

RubCSG at SemEval-2022 Task 5: Ensemble learning for identifying misogynous MEMEs

This work presents an ensemble system based on various uni-modal and bi-...

Please sign up or login with your details

Forgot password? Click here to reset