Fixing Weight Decay Regularization in Adam

11/14/2017
by   Ilya Loshchilov, et al.
0

We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. We propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also demonstrate that longer optimization runs require smaller weight decay values for optimal results and introduce a normalized variant of weight decay to reduce this dependence. Finally, we propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-of-the-art results on CIFAR-10 and ImageNet32x32. Our source code is available at https://github.com/loshchil/AdamW-and-SGDW

READ FULL TEXT

page 6

page 11

page 12

research
11/23/2020

Stable Weight Decay Regularization

Weight decay is a popular regularization technique for training of deep ...
research
09/30/2022

Adaptive Weight Decay: On The Fly Weight Decay Tuning for Improving Robustness

We introduce adaptive weight decay, which automatically tunes the hyper-...
research
06/21/2021

How Do Adam and Training Strategies Help BNNs Optimization?

The best performing Binary Neural Networks (BNNs) are usually attained u...
research
11/14/2019

Understanding the Disharmony between Weight Normalization Family and Weight Decay: ε-shifted L_2 Regularizer

The merits of fast convergence and potentially better performance of the...
research
10/29/2018

Three Mechanisms of Weight Decay Regularization

Weight decay is one of the standard tricks in the neural network toolbox...
research
06/12/2021

Go Small and Similar: A Simple Output Decay Brings Better Performance

Regularization and data augmentation methods have been widely used and b...
research
04/27/2018

Bound and Conquer: Improving Triangulation by Enforcing Consistency

We study the accuracy of triangulation in multi-camera systems with resp...

Please sign up or login with your details

Forgot password? Click here to reset