Data Sets: Word Embeddings Learned from Tweets and General Data

08/14/2017
by   Quanzhi Li, et al.
0

A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2020

Quality of Word Embeddings on Sentiment Analysis Tasks

Word embeddings or distributed representations of words are being used i...
research
11/29/2016

Identity-sensitive Word Embedding through Heterogeneous Networks

Most existing word embedding approaches do not distinguish the same word...
research
07/02/2016

Representation learning for very short texts using weighted word embedding aggregation

Short text messages such as tweets are very noisy and sparse in their us...
research
07/25/2020

Effect of Text Processing Steps on Twitter Sentiment Classification using Word Embedding

Processing of raw text is the crucial first step in text classification ...
research
08/25/2020

A simple method for domain adaptation of sentence embeddings

Pre-trained sentence embeddings have been shown to be very useful for a ...
research
02/01/2019

A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings

Learning word embeddings has received a significant amount of attention ...
research
02/02/2019

Word Embeddings for Sentiment Analysis: A Comprehensive Empirical Survey

This work investigates the role of factors like training method, trainin...

Please sign up or login with your details

Forgot password? Click here to reset