Is SGD a Bayesian sampler? Well, almost

by   Chris Mingard, et al.

Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability P_SGD(f| S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability P_B(f| S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that P_SGD(f| S) correlates remarkably well with P_B(f| S) and that P_B(f| S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines P_B(f| S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior P_B(f| S) is the first order determinant of P_SGD(f| S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on P_SGD(f| S) and/or P_B(f| S), can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.


page 1

page 2

page 3

page 4


Do deep neural networks have an inbuilt Occam's razor?

The remarkable performance of overparameterized deep neural networks (DN...

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Momentum plays a crucial role in stochastic gradient-based optimization ...

Understanding training and generalization in deep learning by Fourier analysis

Background: It is still an open research area to theoretically understan...

Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks

By lifting the ReLU function into a higher dimensional space, we develop...

TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training

Emerging intelligent embedded devices rely on Deep Neural Networks (DNNs...

A self consistent theory of Gaussian Processes captures feature learning effects in finite CNNs

Deep neural networks (DNNs) in the infinite width/channel limit have rec...

Minimum norm solutions do not always generalize well for over-parameterized problems

Stochastic gradient descent is the de facto algorithm for training deep ...

Please sign up or login with your details

Forgot password? Click here to reset