Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

10/18/2019
by   Linjun Shou, et al.
0

Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous model compression methods usually suffer from information loss during the model compression procedure, leading to inferior models compared with the original one. To tackle this challenge, we propose a Two-stage Multi-teacher Knowledge Distillation (TMKD for short) method for web Question Answering system. We first develop a general Q&A distillation task for student model pre-training, and further fine-tune this pre-trained student model with multi-teacher knowledge distillation on downstream tasks (like Web Q&A task, MNLI, SNLI, RTE tasks from GLUE), which effectively reduces the overfitting bias in individual teacher models, and transfers more general knowledge to the student model. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with substantial speedup of model inference.

READ FULL TEXT

page 3

page 8

research
04/21/2019

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have de...
research
10/14/2020

Weight Squeezing: Reparameterization for Compression and Fast Inference

In this work, we present a novel approach for simultaneous knowledge tra...
research
04/05/2021

Compressing Visual-linguistic Model via Knowledge Distillation

Despite exciting progress in pre-training for visual-linguistic (VL) rep...
research
09/26/2021

Improving Question Answering Performance Using Knowledge Distillation and Active Learning

Contemporary question answering (QA) systems, including transformer-base...
research
12/10/2018

Spatial Knowledge Distillation to aid Visual Reasoning

For tasks involving language and vision, the current state-of-the-art me...
research
05/31/2021

Greedy Layer Pruning: Decreasing Inference Time of Transformer Models

Fine-tuning transformer models after unsupervised pre-training reaches a...
research
10/07/2020

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / ...

Please sign up or login with your details

Forgot password? Click here to reset