Two-Tailed Averaging: Anytime Adaptive Once-in-a-while Optimal Iterate Averaging for Stochastic Optimization

by   Gábor Melis, et al.

Tail averaging improves on Polyak averaging's non-asymptotic behaviour by excluding a number of leading iterates of stochastic optimization from its calculations. In practice, with a finite number of optimization steps and a learning rate that cannot be annealed to zero, tail averaging can get much closer to a local minimum point of the training loss than either the individual iterates or the Polyak average. However, the number of leading iterates to ignore is an important hyperparameter, and starting averaging too early or too late leads to inefficient use of resources or suboptimal solutions. Setting this hyperparameter to improve generalization is even more difficult, especially in the presence of other hyperparameters and overfitting. Furthermore, before averaging starts, the loss is only weakly informative of the final performance, which makes early stopping unreliable. To alleviate these problems, we propose an anytime variant of tail averaging, that has no hyperparameters and approximates the optimal tail at all optimization steps. Our algorithm is based on two running averages with adaptive lengths bounded in terms of the optimal tail length, one of which achieves approximate optimality with some regularity. Requiring only the additional storage for two sets of weights and periodic evaluation of the loss, the proposed two-tailed averaging algorithm is a practical and widely applicable method for improving stochastic optimization.


page 1

page 2

page 3

page 4


Stochastic Hyperparameter Optimization through Hypernetworks

Machine learning models are often tuned by nesting optimization of model...

Anytime Tail Averaging

Tail averaging consists in averaging the last examples in a stream. Comm...

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

We study the finite-time behaviour of the popular temporal difference (T...

Beating SGD Saturation with Tail-Averaging and Minibatching

While stochastic gradient descent (SGD) is one of the major workhorses i...

A Dual Accelerated Method for Online Stochastic Distributed Averaging: From Consensus to Decentralized Policy Evaluation

Motivated by decentralized sensing and policy evaluation problems, we co...

Trainable Weight Averaging for Fast Convergence and Better Generalization

Stochastic gradient descent (SGD) and its variants are commonly consider...

Nudge: Stochastically Improving upon FCFS

The First-Come First-Served (FCFS) scheduling policy is the most popular...

Please sign up or login with your details

Forgot password? Click here to reset