Solvable Model for Inheriting the Regularization through Knowledge Distillation

12/01/2020
by   Luca Saglietti, et al.
0

In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge distillation where a smaller neural network is trained using the outputs of a larger neural network is a particularly interesting case of transfer learning. In the present work, we introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distillation (KD) in shallow neural networks. Focusing the analysis on a solvable model that exhibits a non-trivial generalization gap, we investigate the effectiveness of KD. We are able to show that, through KD, the regularization properties of the larger teacher model can be inherited by the smaller student and that the yielded generalization performance is closely linked to and limited by the optimality of the teacher. Finally, we analyze the double descent phenomenology that can arise in the considered KD setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2022

On effects of Knowledge Distillation on Transfer Learning

Knowledge distillation is a popular machine learning technique that aims...
research
04/19/2023

Knowledge Distillation Under Ideal Joint Classifier Assumption

Knowledge distillation is a powerful technique to compress large neural ...
research
08/29/2021

Lipschitz Continuity Guided Knowledge Distillation

Knowledge distillation has become one of the most important model compre...
research
02/25/2021

Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation

Knowledge distillation is classically a procedure where a neural network...
research
10/19/2021

Adaptive Distillation: Aggregating Knowledge from Multiple Paths for Efficient Distillation

Knowledge Distillation is becoming one of the primary trends among neura...
research
03/30/2020

On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in the Kernel Regime

Knowledge distillation (KD), i.e. one classifier being trained on the ou...
research
07/22/2022

Hyper-Representations for Pre-Training and Transfer Learning

Learning representations of neural network weights given a model zoo is ...

Please sign up or login with your details

Forgot password? Click here to reset