The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

by   Ziquan Liu, et al.

Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations. However, using weight decay (WD) benefits these weight-scale-invariant networks, which is often attributed to an increase of the effective learning rate when the weight norms are decreased. In this paper, we demonstrate the insufficiency of the previous explanation and investigate the implicit biases of stochastic gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay. We identity two implicit biases of SGD on BN-DNNs: 1) the weight norms in SGD training remain constant in the continuous-time domain and keep increasing in the discrete-time domain; 2) SGD optimizes weight vectors in fully-connected networks or convolution kernels in convolution neural networks by updating components lying in the input feature span, while leaving those components orthogonal to the input feature span unchanged. Thus, SGD without WD accumulates weight noise orthogonal to the input feature span, and cannot eliminate such noise. Our empirical studies corroborate the hypothesis that weight decay suppresses weight noise that is left untouched by SGD. Furthermore, we propose to use weight rescaling (WRS) instead of weight decay to achieve the same regularization effect, while avoiding performance degradation of WD on some momentum-based optimizers. Our empirical results on image recognition show that regardless of optimization methods and network architectures, training BN-DNNs using WRS achieves similar or better performance compared with using WD. We also show that training with WRS generalizes better compared to WD, on other computer vision tasks.


page 1

page 2

page 3

page 4


L2 Regularization versus Batch and Weight Normalization

Batch Normalization is a commonly used trick to improve the training of ...

Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Stochastic gradient descent (SGD) has become a cornerstone of neural net...

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...

Fine-grained Optimization of Deep Neural Networks

In recent studies, several asymptotic upper bounds on generalization err...

Understanding the Disharmony between Weight Normalization Family and Weight Decay: ε-shifted L_2 Regularizer

The merits of fast convergence and potentially better performance of the...

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

In deep neural nets, lower level embedding layers account for a large po...

Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Normalization techniques, such as batch normalization (BN), have led to ...

Please sign up or login with your details

Forgot password? Click here to reset