Emergent Properties of Finetuned Language Representation Models

10/23/2019
by   Alexandre Matton, et al.
0

Large, self-supervised transformer-based language representation models have recently received significant amounts of attention, and have produced state-of-the-art results across a variety of tasks simply by scaling up pre-training on larger and larger corpora. Such models usually produce high dimensional vectors, on top of which additional task-specific layers and architectural modifications are added to adapt them to specific downstream tasks. Though there exists ample evidence that such models work well, we aim to understand what happens when they work well. We analyze the redundancy and location of information contained in output vectors for one such language representation model – BERT. We show empirical evidence that the [CLS] embedding in BERT contains highly redundant information, and can be compressed with minimal loss of accuracy, especially for finetuned models, dovetailing into open threads in the field about the role of over-parameterization in learning. We also shed light on the existence of specific output dimensions which alone give very competitive results when compared to using all dimensions of output vectors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2019

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations ...
research
05/14/2021

BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT

Multiple studies have shown that BERT is remarkably robust to pruning, y...
research
08/07/2021

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Motivated by the success of masked language modeling (MLM) in pre-traini...
research
08/03/2020

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

Music annotation has always been one of the critical topics in the field...
research
11/26/2020

Molecular representation learning with language models and domain-relevant auxiliary tasks

We apply a Transformer architecture, specifically BERT, to learn flexibl...
research
08/31/2021

Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience

Pretrained transformer-based models such as BERT have demonstrated state...

Please sign up or login with your details

Forgot password? Click here to reset