CompText: Visualizing, Comparing Understanding Text Corpus

by   Suvi Varshney, et al.

A common practice in Natural Language Processing (NLP) is to visualize the text corpus without reading through the entire literature, still grasping the central idea and key points described. For a long time, researchers focused on extracting topics from the text and visualizing them based on their relative significance in the corpus. However, recently, researchers started coming up with more complex systems that not only expose the topics of the corpus but also word closely related to the topic to give users a holistic view. These detailed visualizations spawned research on comparing text corpora based on their visualization. Topics are often compared to idealize the difference between corpora. However, to capture greater semantics from different corpora, researchers have started to compare texts based on the sentiment of the topics related to the text. Comparing the words carrying the most weightage, we can get an idea about the important topics for corpus. There are multiple existing texts comparing methods present that compare topics rather than sentiments but we feel that focusing on sentiment-carrying words would better compare the two corpora. Since only sentiments can explain the real feeling of the text and not just the topic, topics without sentiments are just nouns. We aim to differentiate the corpus with a focus on sentiment, as opposed to comparing all the words appearing in the two corpora. The rationale behind this is, that the two corpora do not many have identical words for side-by-side comparison, so comparing the sentiment words gives us an idea of how the corpora are appealing to the emotions of the reader. We can argue that the entropy or the unexpectedness and divergence of topics should also be of importance and help us to identify key pivot points and the importance of certain topics in the corpus alongside relative sentiment.


Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence

Extracting and identifying latent topics in large text corpora has gaine...

Topic Discovery in Massive Text Corpora Based on Min-Hashing

The task of discovering topics in text corpora has been dominated by Lat...

Extractive and Abstractive Sentence Labelling of Sentiment-bearing Topics

This paper tackles the problem of automatically labelling sentiment-bear...

Re-Ranking Words to Improve Interpretability of Automatically Generated Topics

Topics models, such as LDA, are widely used in Natural Language Processi...

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corp...

Cross-referencing using Fine-grained Topic Modeling

Cross-referencing, which links passages of text to other related passage...

Computational analyses of the topics, sentiments, literariness, creativity and beauty of texts in a large Corpus of English Literature

The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a r...

Please sign up or login with your details

Forgot password? Click here to reset