On Empirical Comparisons of Optimizers for Deep Learning

by   Dami Choi, et al.

Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the metaparameter tuning protocol. Our findings suggest that the metaparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when metaparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the metaparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning often ignored metaparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training.


page 1

page 2

page 3

page 4


Evaluating Deep Learning in SystemML using Layer-wise Adaptive Rate Scaling(LARS) Optimizer

Increasing the batch size of a deep learning model is a challenging task...

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Identifying a divergence problem in Adam, we propose a new optimizer, La...

Unbounded Bayesian Optimization via Regularization

Bayesian optimization has recently emerged as a popular and efficient to...

Efficient Federated Learning via Local Adaptive Amended Optimizer with Linear Speedup

Adaptive optimization has achieved notable success for distributed learn...

Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods

Recent analyses of certain gradient descent optimization methods have sh...

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

Recently the LARS and LAMB optimizers have been proposed for training ne...

DeepOBS: A Deep Learning Optimizer Benchmark Suite

Because the choice and tuning of the optimizer affects the speed, and ul...

Please sign up or login with your details

Forgot password? Click here to reset