MaskedKD: Efficient Distillation of Vision Transformers with Masked Images

02/21/2023
by   Seungwoo Son, et al.
0

Knowledge distillation is a popular and effective regularization technique for training lightweight models, but it also adds significant overhead to the training cost. The drawback is most pronounced when we use large-scale models as our teachers, such as vision transformers (ViTs). We present MaskedKD, a simple yet effective method for reducing the training cost of ViT distillation. MaskedKD masks a fraction of image patch tokens fed to the teacher to save the teacher inference cost. The tokens to mask are determined based on the last layer attention score of the student model, to which we provide the full image. Without requiring any architectural change of the teacher or making sacrifices in the student performance, MaskedKD dramatically reduces the computations and time required for distilling ViTs. We demonstrate that MaskedKD can save up to 50% of the cost of running inference on the teacher model without any performance drop on the student, leading to approximately 28% drop in the teacher and student compute combined.

READ FULL TEXT

page 4

page 14

research
03/26/2022

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teache...
research
05/27/2023

Vision Transformers for Small Histological Datasets Learned through Knowledge Distillation

Computational Pathology (CPATH) systems have the potential to automate d...
research
10/19/2021

When in Doubt, Summon the Titans: Efficient Inference with Large Models

Scaling neural networks to "large" sizes, with billions of parameters, h...
research
12/06/2018

Online Model Distillation for Efficient Video Inference

High-quality computer vision models typically address the problem of und...
research
06/07/2022

DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers

The computational benefits of iterative non-autoregressive transformers ...
research
05/03/2023

SCOTT: Self-Consistent Chain-of-Thought Distillation

Large language models (LMs) beyond a certain scale, demonstrate the emer...
research
03/16/2023

Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval

Previous Knowledge Distillation based efficient image retrieval methods ...

Please sign up or login with your details

Forgot password? Click here to reset