Knowledge Distillation of Transformer-based Language Models Revisited

06/29/2022
by   Chengqiang Lu, et al.
0

In the past few years, transformer-based pre-trained language models have achieved astounding success in both industry and academia. However, the large model size and high run-time latency are serious impediments to applying them in practice, especially on mobile phones and Internet of Things (IoT) devices. To compress the model, considerable literature has grown up around the theme of knowledge distillation (KD) recently. Nevertheless, how KD works in transformer-based models is still unclear. We tease apart the components of KD and propose a unified KD framework. Through the framework, systematic and extensive experiments that spent over 23,000 GPU hours render a comprehensive analysis from the perspectives of knowledge types, matching strategies, width-depth trade-off, initialization, model size, etc. Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA). Finally, we provide a best-practice guideline for the KD in transformer-based models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has ma...
research
06/11/2023

GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

Currently, the reduction in the parameter scale of large-scale pre-train...
research
09/18/2023

Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Small CNN-based models usually require transferring knowledge from a lar...
research
01/30/2023

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

Large-scale pre-trained language models (PLMs) with powerful language mo...
research
08/03/2021

Linking Common Vulnerabilities and Exposures to the MITRE ATT CK Framework: A Self-Distillation Approach

Due to the ever-increasing threat of cyber-attacks to critical cyber inf...
research
04/12/2020

TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

Deep and large pre-trained language models are the state-of-the-art for ...
research
12/29/2020

Accelerating Pre-trained Language Models via Calibrated Cascade

Dynamic early exiting aims to accelerate pre-trained language models' (P...

Please sign up or login with your details

Forgot password? Click here to reset