On Extending NLP Techniques from the Categorical to the Latent Space: KL Divergence, Zipf's Law, and Similarity Search

12/02/2020
by   Adam Hare, et al.
0

Despite the recent successes of deep learning in natural language processing (NLP), there remains widespread usage of and demand for techniques that do not rely on machine learning. The advantage of these techniques is their interpretability and low cost when compared to frequently opaque and expensive machine learning models. Although they may not be be as performant in all cases, they are often sufficient for common and relatively simple problems. In this paper, we aim to modernize these older methods while retaining their advantages by extending approaches from categorical or bag-of-words representations to word embeddings representations in the latent space. First, we show that entropy and Kullback-Leibler divergence can be efficiently estimated using word embeddings and use this estimation to compare text across several categories. Next, we recast the heavy-tailed distribution known as Zipf's law that is frequently observed in the categorical space to the latent space. Finally, we look to improve the Jaccard similarity measure for sentence suggestion by introducing a new method of identifying similar sentences based on the set cover problem. We compare the performance of this algorithm against several baselines including Word Mover's Distance and the Levenshtein distance.

READ FULL TEXT
research
11/01/2017

Semantic Structure and Interpretability of Word Embeddings

Dense word embeddings, which encode semantic meanings of words to low di...
research
04/20/2020

Learning Geometric Word Meta-Embeddings

We propose a geometric framework for learning meta-embeddings of words f...
research
08/27/2018

Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach

We propose a novel geometric approach for learning bilingual mappings gi...
research
08/31/2019

Rethinking travel behavior modeling representations through embeddings

This paper introduces the concept of travel behavior embeddings, a metho...
research
04/28/2020

The Immersion of Directed Multi-graphs in Embedding Fields. Generalisations

The purpose of this paper is to outline a generalised model for represen...
research
11/09/2018

Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence

In this paper we introduce evidence transfer for clustering, a deep lear...
research
04/02/2021

Query2Prod2Vec Grounded Word Embeddings for eCommerce

We present Query2Prod2Vec, a model that grounds lexical representations ...

Please sign up or login with your details

Forgot password? Click here to reset