A Comparison of Word Embeddings for the Biomedical Natural Language Processing

by   Yanshan Wang, et al.

Neural word embeddings have been widely used in biomedical Natural Language Processing (NLP) applications since they provide vector representations of words that capture the semantic properties of words and the linguistic relationship between words. Many biomedical applications use different textual sources to train word embeddings and apply these word embeddings to downstream biomedical applications. However, there has been little work on comprehensively evaluating the word embeddings trained from these resources. In this study, we provide a comprehensive empirical evaluation of word embeddings trained from four different resources, namely clinical notes, biomedical publications, Wikepedia, and news. We perform the evaluation qualitatively and quantitatively. In qualitative evaluation, we manually inspect five most similar medical words to a given set of target medical words, and then analyze word embeddings through the visualization of those word embeddings. Quantitative evaluation falls into two categories: extrinsic and intrinsic evaluation. Based on the evaluation results, we can draw the following conclusions. First, EHR and PubMed can capture the semantics of medical terms better than GloVe and Google News and find more relevant similar medical terms. Second, the medical semantic similarity captured by the word embeddings trained on EHR and PubMed are closer to human experts' judgments, compared to these trained on GloVe and Google News. Third, there does not exist a consistent global ranking of word embedding quality for downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, word embeddings trained from a similar domain corpus do not necessarily have better performance than other word embeddings for any downstream biomedical tasks.


page 1

page 2

page 3

page 4


Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain

Word embeddings have found their way into a wide range of natural langua...

Spanish Biomedical and Clinical Language Embeddings

We computed both Word and Sub-word Embeddings using FastText. For Sub-wo...

Insights into Analogy Completion from the Biomedical Domain

Analogy completion has been a popular task in recent years for evaluatin...

Semi-automatic WordNet Linking using Word Embeddings

Wordnets are rich lexico-semantic resources. Linked wordnets are extensi...

How Powerful Are Randomly Initialized Pointcloud Set Functions?

We study random embeddings produced by untrained neural set functions, a...

Clinical Concept Embeddings Learned from Massive Sources of Medical Data

Word embeddings have emerged as a popular approach to unsupervised learn...

CogniFNN: A Fuzzy Neural Network Framework for Cognitive Word Embedding Evaluation

Word embeddings can reflect the semantic representations, and the embedd...

Please sign up or login with your details

Forgot password? Click here to reset