Exploiting user-frequency information for mining regionalisms from Social Media texts

07/10/2019
by   Juan Manuel Pérez, et al.
0

The task of detecting regionalisms (expressions or words used in certain regions) has traditionally relied on the use of questionnaires and surveys, and has also heavily depended on the expertise and intuition of the surveyor. The irruption of Social Media and its microblogging services has produced an unprecedented wealth of content, mainly informal text generated by users, opening new opportunities for linguists to extend their studies of language variation. Previous work on automatic detection of regionalisms depended mostly on word frequencies. In this work, we present a novel metric based on Information Theory that incorporates user frequency. We tested this metric on a corpus of Argentinian Spanish tweets in two ways: via manual annotation of the relevance of the retrieved terms, and also as a feature selection method for geolocation of users. In either case, our metric outperformed other techniques based solely in word frequency, suggesting that measuring the amount of users that produce a word is informative. This tool has helped lexicographers discover several unregistered words of Argentinian Spanish, as well as different meanings assigned to registered words.

READ FULL TEXT
research
06/14/2018

Humor Detection in English-Hindi Code-Mixed Social Media Content : Corpus and Baseline System

The tremendous amount of user generated data through social networking s...
research
12/24/2016

Predicting the Industry of Users on Social Media

Automatic profiling of social media users is an important task for suppo...
research
05/16/2017

Social Media-based Substance Use Prediction

In this paper, we demonstrate how the state-of-the-art machine learning ...
research
01/13/2018

Detecting Offensive Language in Tweets Using Deep Learning

This paper addresses the important problem of discerning hateful content...
research
08/28/2023

Domain-based user embedding for competing events on social media

Online social networks offer vast opportunities for computational social...
research
02/12/2021

Characterizing English Variation across Social Media Communities with BERT

Much previous work characterizing language variation across Internet soc...
research
07/09/2019

Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

Stretched words like `heellllp' or `heyyyyy' are a regular feature of sp...

Please sign up or login with your details

Forgot password? Click here to reset