Document Embedding with Paragraph Vectors

07/29/2015
by   Andrew M. Dai, et al.
0

Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2015

Class Vectors: Embedding representation of Document Classes

Distributed representations of words and paragraphs as semantic embeddin...
research
06/01/2020

Hybrid Improved Document-level Embedding (HIDE)

In recent times, word embeddings are taking a significant role in sentim...
research
07/08/2017

Efficient Vector Representation for Documents through Corruption

We present an efficient document representation learning framework, Docu...
research
05/31/2017

Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations

We investigate the pertinence of methods from algebraic topology for tex...
research
04/05/2016

Feature extraction using Latent Dirichlet Allocation and Neural Networks: A case study on movie synopses

Feature extraction has gained increasing attention in the field of machi...
research
12/27/2015

Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews

Despite the loss of semantic information, bag-of-ngram based methods sti...
research
12/11/2015

Words are not Equal: Graded Weighting Model for building Composite Document Vectors

Despite the success of distributional semantics, composing phrases from ...

Please sign up or login with your details

Forgot password? Click here to reset