Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

by   Boaz Barak, et al.

There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning k-sparse parities of n bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with n^O(k) examples, with loss (and error) curves abruptly dropping after n^O(k) iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD "stumbling in the dark" until it finds the hidden set of features (a natural algorithm which also runs in n^O(k) time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.


page 33

page 34


SGD with large step sizes learns sparse features

We showcase important features of the dynamics of the Stochastic Gradien...

Exact Phase Transitions in Deep Learning

This work reports deep-learning-unique first-order and second-order phas...

The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

To understand how deep learning works, it is crucial to understand the t...

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Deep learning thrives with large neural networks and large datasets. How...

SGD on Neural Networks Learns Functions of Increasing Complexity

We perform an experimental study of the dynamics of Stochastic Gradient ...

Directional Pruning of Deep Neural Networks

In the light of the fact that the stochastic gradient descent (SGD) ofte...

Unrolling SGD: Understanding Factors Influencing Machine Unlearning

Machine unlearning is the process through which a deployed machine learn...

Please sign up or login with your details

Forgot password? Click here to reset