Effect of the initial configuration of weights on the training and function of artificial neural networks

by   R. J. Jesus, et al.

The function and performance of neural networks is largely determined by the evolution of their weights and biases in the process of training, starting from the initial configuration of these parameters to one of the local minima of the loss function. We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU networks of various sizes trained via Stochastic Gradient Descent (SGD) from their initial random configuration. We compare the evolution of the distribution function of this deviation with the evolution of the loss during training. We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights. For each initial weight of a link we measured the distribution function of the deviation from this value after training and found how the moments of this distribution and its peak depend on the initial weight. We explored the evolution of these deviations during training and observed an abrupt increase within the overfitting region. This jump occurs simultaneously with a similarly abrupt increase recorded in the evolution of the loss function. Our results suggest that SGD's ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.


page 3

page 5


The Sobolev Regularization Effect of Stochastic Gradient Descent

The multiplicative structure of parameters and input data in the first l...

An Alternative View: When Does SGD Escape Local Minima?

Stochastic gradient descent (SGD) is widely used in machine learning. Al...

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Large-batch stochastic gradient descent (SGD) is widely used for trainin...

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data whic...

Shaping the learning landscape in neural networks around wide flat minima

Learning in Deep Neural Networks (DNN) takes place by minimizing a non-c...

Directing Chemotaxis-Based Spatial Self-Organization via Biased, Random Initial Conditions

Inspired by the chemotaxis interaction of living cells, we have develope...

Global Convergence of SGD On Two Layer Neural Nets

In this note we demonstrate provable convergence of SGD to the global mi...

Please sign up or login with your details

Forgot password? Click here to reset