Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question – to smooth or not to smooth a teacher network? – unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students. Code and models are available at https://keshik6.github.io/revisiting-ls-kd-compatibility/

READ FULL TEXT

page 23

page 27

research
09/25/2019

Revisit Knowledge Distillation: a Teacher-free Framework

Knowledge Distillation (KD) aims to distill the knowledge of a cumbersom...
research
01/30/2023

On student-teacher deviations in distillation: does it pay to disobey?

Knowledge distillation has been widely-used to improve the performance o...
research
04/01/2021

Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study

This work aims to empirically clarify a recently discovered perspective ...
research
06/25/2016

Sequence-Level Knowledge Distillation

Neural machine translation (NMT) offers a novel alternative formulation ...
research
01/30/2023

Knowledge Distillation ≈ Label Smoothing: Fact or Fallacy?

Contrary to its original interpretation as a facilitator of knowledge tr...
research
05/18/2022

[Re] Distilling Knowledge via Knowledge Review

This effort aims to reproduce the results of experiments and analyze the...

Please sign up or login with your details

Forgot password? Click here to reset