BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

02/07/2020
by   Canwen Xu, et al.
0

In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process. Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/07/2021

Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression

Recent studies on compression of pretrained language models (e.g., BERT)...
research
08/26/2023

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...
research
04/08/2020

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

BERT is a cutting-edge language representation model pre-trained by a la...
research
04/21/2019

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have de...
research
08/29/2023

SpikeBERT: A Language Spikformer Trained with Two-Stage Knowledge Distillation from BERT

Spiking neural networks (SNNs) offer a promising avenue to implement dee...
research
06/04/2021

You Only Compress Once: Towards Effective and Elastic BERT Compression via Exploit-Explore Stochastic Nature Gradient

Despite superior performance on various natural language processing task...
research
10/08/2015

Distilling Model Knowledge

Top-performing machine learning systems, such as deep neural networks, l...

Please sign up or login with your details

Forgot password? Click here to reset