Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization

by   Brendan Ruff, et al.

Deep convolutional neural networks are known to be unstable during training at high learning rate unless normalization techniques are employed. Normalizing weights or activations allows the use of higher learning rates, resulting in faster convergence and higher test accuracy. Batch normalization requires minibatch statistics that approximate the dataset statistics but this incurs additional compute and memory costs and causes a communication bottleneck for distributed training. Weight normalization and initialization-only schemes do not achieve comparable test accuracy. We introduce a new understanding of the cause of training instability and provide a technique that is independent of normalization and minibatch statistics. Our approach treats training instability as a spatial common mode signal which is suppressed by placing the model on a channel-wise zero-mean isocline that is maintained throughout training. Firstly, we apply channel-wise zero-mean initialization of filter kernels with overall unity kernel magnitude. At each training step we modify the gradients of spatial kernels so that their weighted channel-wise mean is subtracted in order to maintain the common mode rejection condition. This prevents the onset of mean shift. This new technique allows direct training of the test graph so that training and test models are identical. We also demonstrate that injecting random noise throughout the network during training improves generalization. This is based on the idea that, as a side effect, batch normalization performs deep data augmentation by injecting minibatch noise due to the weakness of the dataset approximation. Our technique achieves higher accuracy compared to batch normalization and for the first time shows that minibatches and normalization are unnecessary for state-of-the-art training.


page 1

page 2

page 3

page 4


Understanding Batch Normalization

Batch normalization is a ubiquitous deep learning technique that normali...

Deep Control - a simple automatic gain control for memory efficient and high performance training of deep convolutional neural networks

Training a deep convolutional neural net typically starts with a random ...

Convolutional Normalization

As the deep neural networks are being applied to complex tasks, the size...

Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory

Batch Normalization (BatchNorm) is commonly used in Convolutional Neural...

Non-Proportional Parametrizations for Stable Hypernetwork Learning

Hypernetworks are neural networks that generate the parameters of anothe...

Deep equilibrium networks are sensitive to initialization statistics

Deep equilibrium networks (DEQs) are a promising way to construct models...

SelfNorm and CrossNorm for Out-of-Distribution Robustness

Normalization techniques are crucial in stabilizing and accelerating the...

Please sign up or login with your details

Forgot password? Click here to reset