Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

03/31/2021
by   Zeke Xie, et al.
1

It is well-known that stochastic gradient noise (SGN) acts as implicit regularization for deep learning and is essentially important for both optimization and generalization of deep networks. Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning. However, it turned out that the injected simple random noise cannot work as well as SGN, which is anisotropic and parameter-dependent. For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach that is a powerful alternative to conventional Momentum in classic optimizers. The introduced PNM method maintains two approximate independent momentum terms. Then, we can control the magnitude of SGN explicitly by adjusting the momentum difference. We theoretically prove the convergence guarantee and the generalization advantage of PNM over Stochastic Gradient Descent (SGD). By incorporating PNM into the two conventional optimizers, SGD with Momentum and Adam, our extensive experiments empirically verified the significant advantage of the PNM-based variants over the corresponding conventional Momentum-based optimizers. Code: <https://github.com/zeke-xie/Positive-Negative-Momentum>.

READ FULL TEXT
research
07/27/2023

The Marginal Value of Momentum for Small Learning Rate SGD

Momentum is known to accelerate the convergence of gradient descent in s...
research
08/09/2021

On the Hyperparameters in Stochastic Gradient Descent with Momentum

Following the same routine as [SSJ20], we continue to present the theore...
research
03/02/2019

Time-Delay Momentum: A Regularization Perspective on the Convergence and Generalization of Stochastic Momentum for Deep Learning

In this paper we study the problem of convergence and generalization err...
research
07/13/2022

Towards understanding how momentum improves generalization in deep learning

Stochastic gradient descent (SGD) with momentum is widely used for train...
research
01/18/2022

AdaTerm: Adaptive T-Distribution Estimated Robust Moments towards Noise-Robust Stochastic Gradient Optimizer

As the problems to be optimized with deep learning become more practical...
research
06/23/2020

Spherical Perspective on Learning with Batch Norm

Batch Normalization (BN) is a prominent deep learning technique. In spit...
research
09/05/2023

A Simple Asymmetric Momentum Make SGD Greatest Again

We propose the simplest SGD enhanced method ever, Loss-Controlled Asymme...

Please sign up or login with your details

Forgot password? Click here to reset