BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance
Pretraining deep contextualized representations using an unsupervised language modeling objective has led to large performance gains for a variety of NLP tasks. Notwithstanding their enormous success, recent work by Schick and Schütze (2019) suggests that these architectures struggle to understand many rare words. For context-independent word embeddings, this problem can be addressed by explicitly relearning representations for infrequent words. In this work, we show that the very same idea can also be applied to contextualized models and clearly improves their downstream task performance. As previous approaches for relearning word embeddings are commonly based on fairly simple bag-of-words models, they are no suitable counterpart for complex language models based on deep neural networks. To overcome this problem, we introduce BERTRAM, a powerful architecture that is based on a pretrained BERT language model and capable of inferring high-quality representations for rare words through a deep interconnection of their surface form and the contexts in which they occur. Both on a rare word probing task and on three downstream task datasets, BERTRAM considerably improves representations for rare and medium frequency words compared to both a standalone BERT model and previous work.
READ FULL TEXT