Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

by   H. J. Meijer, et al.

Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific nuances such as domain specific vocabulary, abbreviations, or scientific formulas which are commonly used in academic context. This research focuses on the performance of word embeddings applied to a large scale academic corpus. More specifically, we compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. We use a word2vec skip-gram model trained on titles and abstracts of about 70 million scientific articles. Furthermore, we have developed a benchmark to evaluate content models in a scientific context. The benchmark is based on a categorization task that matches articles to journals for about 1.3 million articles published in 2017. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text). However, the slight improvement of TFIDF for larger text comes at the expense of 3.7 times more memory requirement as well as up to 184 times higher computation times which may make it inefficient for online applications. In addition, we have created a 2-dimensional visualization of the journals modeled via embeddings to qualitatively inspect embedding model. This graph shows useful insights and can be used to find competitive journals or gaps to propose new journals.


Domain-Specific Word Embeddings with Structure Prediction

Complementary to finding good general word embeddings, an important ques...

Word Embeddings for the Armenian Language: Intrinsic and Extrinsic Evaluation

In this work, we intrinsically and extrinsically evaluate and compare ex...

Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases

Text representation using neural word embeddings has proven efficacy in ...

Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach

Traditional disease surveillance can be augmented with a wide variety of...

Word Embeddings for the Construction Domain

We introduce word vectors for the construction domain. Our vectors were ...

Clinical Concept Embeddings Learned from Massive Sources of Medical Data

Word embeddings have emerged as a popular approach to unsupervised learn...

Political Depolarization of News Articles Using Attribute-aware Word Embeddings

Political polarization in the US is on the rise. This polarization negat...

Please sign up or login with your details

Forgot password? Click here to reset