Improving BERT Fine-tuning with Embedding Normalization

11/10/2019
by   Wenxuan Zhou, et al.
0

Large pre-trained sentence encoders like BERT start a new chapter in natural language processing. A common practice to apply pre-trained BERT to sequence classification tasks (e.g., classification of sentences or sentence pairs) is by feeding the embedding of [CLS] token (in the last layer) to a task-specific classification layer, and then fine tune the model parameters of BERT and classifier jointly. In this paper, we conduct systematic analysis over several sequence classification datasets to examine the embedding values of [CLS] token before the fine tuning phase, and present the biased embedding distribution issue—i.e., embedding values of [CLS] concentrate on a few dimensions and are non-zero centered. Such biased embedding brings challenge to the optimization process during fine-tuning as gradients of [CLS] embedding may explode and result in degraded model performance. We further propose several simple yet effective normalization methods to modify the [CLS] embedding during the fine-tuning. Compared with the previous practice, neural classification model with the normalized embedding shows improvements on several text classification tasks, demonstrates the effectiveness of our method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2020

UHH-LT LT2 at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection

Fine-tuning of pre-trained transformer networks such as BERT yield state...
research
04/10/2020

SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification

Fine-tuning pre-trained sentence embedding models like BERT has become t...
research
11/22/2021

Finding the Winning Ticket of BERT for Binary Text Classification via Adaptive Layer Truncation before Fine-tuning

In light of the success of transferring language models into NLP tasks, ...
research
06/14/2021

Exploiting Sentence-Level Representations for Passage Ranking

Recently, pre-trained contextual models, such as BERT, have shown to per...
research
10/14/2022

Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding

Dataset bias has attracted increasing attention recently for its detrime...
research
05/06/2019

Anonymized BERT: An Augmentation Approach to the Gendered Pronoun Resolution Challenge

We present our 7th place solution to the Gendered Pronoun Resolution cha...
research
04/16/2021

Fast, Effective and Self-Supervised: Transforming Masked LanguageModels into Universal Lexical and Sentence Encoders

Pretrained Masked Language Models (MLMs) have revolutionised NLP in rece...

Please sign up or login with your details

Forgot password? Click here to reset