Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

by   Ryan J. Gallagher, et al.

A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences. Through several case studies, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.


page 1

page 2

page 3

page 4


Benchmarking sentiment analysis methods for large-scale texts: A case for using continuum-scored words and word shift graphs

The emergence and global adoption of social media has rendered possible ...

Generalized Entropies and the Similarity of Texts

We show how generalized Gibbs-Shannon entropies can provide new insights...

Variation of word frequencies in Russian literary texts

We study the variation of word frequencies in Russian literary texts. Ou...

CompText: Visualizing, Comparing Understanding Text Corpus

A common practice in Natural Language Processing (NLP) is to visualize t...

The word entropy of natural languages

The average uncertainty associated with words is an information-theoreti...

Quantifying the Dissimilarity of Texts

Quantifying the dissimilarity of two texts is an important aspect of a n...

Gov2Vec: Learning Distributed Representations of Institutions and Their Legal Text

We compare policy differences across institutions by embedding represent...

Please sign up or login with your details

Forgot password? Click here to reset