Variance Reduction in Deep Learning: More Momentum is All You Need

11/23/2021
by   Lionel Tondji, et al.
5

Variance reduction (VR) techniques have contributed significantly to accelerating learning with massive datasets in the smooth and strongly convex setting (Schmidt et al., 2017; Johnson Zhang, 2013; Roux et al., 2012). However, such techniques have not yet met the same success in the realm of large-scale deep learning due to various factors such as the use of data augmentation or regularization methods like dropout (Defazio Bottou, 2019). This challenge has recently motivated the design of novel variance reduction techniques tailored explicitly for deep learning (Arnold et al., 2019; Ma Yarats, 2018). This work is an additional step in this direction. In particular, we exploit the ubiquitous clustering structure of rich datasets used in deep learning to design a family of scalable variance reduced optimization procedures by combining existing optimizers (e.g., SGD+Momentum, Quasi Hyperbolic Momentum, Implicit Gradient Transport) with a multi-momentum strategy (Yuan et al., 2019). Our proposal leads to faster convergence than vanilla methods on standard benchmark datasets (e.g., CIFAR and ImageNet). It is robust to label noise and amenable to distributed optimization. We provide a parallel implementation in JAX.

READ FULL TEXT

page 7

page 8

research
06/12/2017

YellowFin and the Art of Momentum Tuning

Hyperparameter tuning is one of the big costs of deep learning. State-of...
research
06/11/2021

Label Noise SGD Provably Prefers Flat Global Minimizers

In overparametrized models, the noise in stochastic gradient descent (SG...
research
02/17/2023

Optimal Training of Mean Variance Estimation Neural Networks

This paper focusses on the optimal implementation of a Mean Variance Est...
research
10/16/2018

Quasi-hyperbolic momentum and Adam for deep learning

Momentum-based acceleration of stochastic gradient descent (SGD) is wide...
research
12/05/2021

Training Structured Neural Networks Through Manifold Identification and Variance Reduction

This paper proposes an algorithm (RMDA) for training neural networks (NN...
research
07/25/2020

Variance Reduction for Deep Q-Learning using Stochastic Recursive Gradient

Deep Q-learning algorithms often suffer from poor gradient estimations w...
research
10/14/2019

On the Reduction of Variance and Overestimation of Deep Q-Learning

The breakthrough of deep Q-Learning on different types of environments r...

Please sign up or login with your details

Forgot password? Click here to reset