Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

by   Avery Ma, et al.

Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models' generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (signGD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and signGD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models' robustness to ℓ_2-norm bounded changes is inversely proportional to the model parameters' weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks show smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods.


page 8

page 32

page 38


The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a m...

Why Deep Learning Generalizes

Very large deep learning models trained using gradient descent are remar...

Inherent Noise in Gradient Based Methods

Previous work has examined the ability of larger capacity neural network...

Towards Better Generalization: BP-SVRG in Training Deep Neural Networks

Stochastic variance-reduced gradient (SVRG) is a classical optimization ...

Minimum norm solutions do not always generalize well for over-parameterized problems

Stochastic gradient descent is the de facto algorithm for training deep ...

Explicit Regularization in Overparametrized Models via Noise Injection

Injecting noise within gradient descent has several desirable features. ...

Soft Merging: A Flexible and Robust Soft Model Merging Approach for Enhanced Neural Network Performance

Stochastic Gradient Descent (SGD), a widely used optimization algorithm ...

Please sign up or login with your details

Forgot password? Click here to reset