ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

06/04/2021
by   Weiyue Su, et al.
8

Pretrained language models (PLMs) such as BERT adopt a training paradigm which first pretrain the model in general data and then finetune the model on task-specific data, and have recently achieved great success. However, PLMs are notorious for their enormous parameters and hard to be deployed on real-life applications. Knowledge distillation has been prevailing to address this problem by transferring knowledge from a large teacher to a much smaller student over a set of data. We argue that the selection of thee three key components, namely teacher, training data, and learning objective, is crucial to the effectiveness of distillation. We, therefore, propose a four-stage progressive distillation framework ERNIE-Tiny to compress PLM, which varies the three components gradually from general level to task-specific level. Specifically, the first stage, General Distillation, performs distillation with guidance from pretrained teacher, gerenal data and latent distillation loss. Then, General-Enhanced Distillation changes teacher model from pretrained teacher to finetuned teacher. After that, Task-Adaptive Distillation shifts training data from general data to task-specific data. In the end, Task-Specific Distillation, adds two additional losses, namely Soft-Label and Hard-Label loss onto the last stage. Empirical results demonstrate the effectiveness of our framework and generalization gain brought by ERNIE-Tiny.In particular, experiments show that a 4-layer ERNIE-Tiny maintains over 98.0 surpassing state-of-the-art (SOTA) by 1.0 parameters. Moreover, ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4 parameters and9.4x faster inference speed.

READ FULL TEXT
research
04/07/2020

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

Recently, BERT has become an essential ingredient of various NLP deep mo...
research
06/23/2020

Distilling Object Detectors with Task Adaptive Regularization

Current state-of-the-art object detectors are at the expense of high com...
research
05/24/2023

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Recently, various intermediate layer distillation (ILD) objectives have ...
research
05/03/2023

A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training

Modern Natural Language Generation (NLG) models come with massive comput...
research
09/13/2020

DistilE: Distiling Knowledge Graph Embeddings for Faster and Cheaper Reasoning

Knowledge Graph Embedding (KGE) is a popular method for KG reasoning and...
research
02/20/2023

Progressive Knowledge Distillation: Building Ensembles for Efficient Inference

We study the problem of progressive distillation: Given a large, pre-tra...
research
09/07/2021

Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression

Recent studies on compression of pretrained language models (e.g., BERT)...

Please sign up or login with your details

Forgot password? Click here to reset