Learning to Distill: The Essence Vector Modeling Framework

by   Kuan-Yu Chen, et al.

In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some tasks, such as sentiment classification and document summarization. Nevertheless, as far as we are aware, there is relatively less work focusing on the development of unsupervised paragraph embedding methods. Classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are twofold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition.


page 1

page 2

page 3

page 4


Leveraging Word Embeddings for Spoken Document Summarization

Owing to the rapidly growing multimedia content available on the Interne...

A neural document language modeling framework for spoken document retrieval

Recent developments in deep learning have led to a significant innovatio...

Novel Word Embedding and Translation-based Language Modeling for Extractive Speech Summarization

Word embedding methods revolve around learning continuous distributed ve...

Improved Spoken Document Summarization with Coverage Modeling Techniques

Extractive summarization aims at selecting a set of indicative sentences...

On the Robustness of Text Vectorizers

A fundamental issue in natural language processing is the robustness of ...

An Effective Contextual Language Modeling Framework for Speech Summarization with Augmented Features

Tremendous amounts of multimedia associated with speech information are ...

Representation Learning for Discovering Phonemic Tone Contours

Tone is a prosodic feature used to distinguish words in many languages, ...

Please sign up or login with your details

Forgot password? Click here to reset