Variance-Preserving Initialization Schemes Improve Deep Network Training: But Which Variance is Preserved?

02/13/2019
by   Kyle Luther, et al.
0

Before training a neural net, a classic rule of thumb is to randomly initialize the weights so that the variance of the preactivation is preserved across all layers. This is traditionally interpreted using the total variance due to randomness in both networks (weights) and samples. Alternatively, one can interpret the rule of thumb as preservation of the sample mean and variance for a fixed network, i.e., preactivation statistics computed over the random sample of training samples. The two interpretations differ little for a shallow net, but the difference is shown to be large for a deep ReLU net by decomposing the total variance into the network-averaged sum of the sample variance and square of the sample mean. We demonstrate that the latter term dominates in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width. Our experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance. We discuss the implications for the alternative rule of thumb that a network should be initialized to be at the "edge of chaos."

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2023

Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks

We analyze the dynamics of finite width effects in wide but finite featu...
research
09/13/2019

Finite Depth and Width Corrections to the Neural Tangent Kernel

We prove the precise scaling, at finite depth and width, for the mean an...
research
07/17/2022

Improving Deep Neural Network Random Initialization Through Neuronal Rewiring

The deep learning literature is continuously updated with new architectu...
research
06/07/2021

The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

Theoretical results show that neural networks can be approximated by Gau...
research
06/11/2021

Precise characterization of the prior predictive distribution of deep ReLU networks

Recent works on Bayesian neural networks (BNNs) have highlighted the nee...
research
02/02/2019

Complexity, Statistical Risk, and Metric Entropy of Deep Nets Using Total Path Variation

For any ReLU network there is a representation in which the sum of the a...

Please sign up or login with your details

Forgot password? Click here to reset