Mixed Membership Word Embeddings for Computational Social Science

by   James Foulds, et al.

Word embeddings improve the performance of NLP systems by revealing the hidden structural relationships between words. These models have recently risen in popularity due to the performance of scalable algorithms trained in the big data setting. Despite their success, word embeddings have seen very little use in computational social science NLP tasks, presumably due to their reliance on big data, and to a lack of interpretability. I propose a probabilistic model-based word embedding method which can recover interpretable embeddings, without big data. The key insight is to leverage the notion of mixed membership modeling, in which global representations are shared, but individual entities (i.e. dictionary words) are free to use these representations to uniquely differing degrees. Leveraging connections to topic models, I show how to train these models in high dimensions using a combination of state-of-the-art techniques for word embeddings and topic modeling. Experimental results show an improvement in predictive performance of up to 63 small datasets. The models are interpretable, as embeddings of topics are used to encode embeddings for words (and hence, documents) in a model-based way. I illustrate this with two computational social science case studies, on NIPS articles and State of the Union addresses.


page 1

page 2

page 3

page 4


Interpretable Word Embeddings via Informative Priors

Word embeddings have demonstrated strong performance on NLP tasks. Howev...

WMDecompose: A Framework for Leveraging the Interpretable Properties of Word Mover's Distance in Sociocultural Analysis

Despite the increasing popularity of NLP in the humanities and social sc...

A Neural Generative Model for Joint Learning Topics and Topic-Specific Word Embeddings

We propose a novel generative model to explore both local and global con...

LTSG: Latent Topical Skip-Gram for Mutually Learning Topic Model and Vector Representations

Topic models have been widely used in discovering latent topics which ar...

IRB-NLP at SemEval-2022 Task 1: Exploring the Relationship Between Words and Their Semantic Representations

What is the relation between a word and its description, or a word and i...

Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences

Pre-trained language models have led to a new state-of-the-art in many N...

Equation Embeddings

We present an unsupervised approach for discovering semantic representat...

Please sign up or login with your details

Forgot password? Click here to reset