Distilling the Knowledge of BERT for Text Generation

11/10/2019
by   Yen-Chun Chen, et al.
0

Large-scale pre-trained language model, such as BERT, has recently achieved great success in a wide range of language understanding tasks. However, it remains an open question how to utilize BERT for text generation tasks. In this paper, we present a novel approach to addressing this challenge in a generic sequence-to-sequence (Seq2Seq) setting. We first propose a new task, Conditional Masked Language Modeling (C-MLM), to enable fine-tuning of BERT on target text-generation dataset. The fine-tuned BERT (i.e., teacher) is then exploited as extra supervision to improve conventional Seq2Seq models (i.e., student) for text generation. By leveraging BERT's idiosyncratic bidirectional nature, distilling the knowledge learned from BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong baselines of Transformer on multiple text generation tasks, including machine translation (MT) and text summarization. Our proposed model also achieves new state-of-the-art results on the IWSLT German-English and English-Vietnamese MT datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2020

Incorporating BERT into Parallel Sequence Decoding with Adapters

While large scale pre-trained language models such as BERT have achieved...
research
05/17/2021

Stage-wise Fine-tuning for Graph-to-Text Generation

Graph-to-text generation has benefited from pre-trained language models ...
research
09/10/2020

Modern Methods for Text Generation

Synthetic text generation is challenging and has limited success. Recent...
research
04/22/2020

Residual Energy-Based Models for Text Generation

Text generation is ubiquitous in many NLP tasks, from summarization, to ...
research
10/29/2019

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

We present BART, a denoising autoencoder for pretraining sequence-to-seq...
research
04/22/2020

Keyphrase Prediction With Pre-trained Language Model

Recently, generative methods have been widely used in keyphrase predicti...
research
05/18/2020

GPT-too: A language-model-first approach for AMR-to-text generation

Abstract Meaning Representations (AMRs) are broad-coverage sentence-leve...

Please sign up or login with your details

Forgot password? Click here to reset