Training Tips for the Transformer Model

04/01/2018
by   Martin Popel, et al.
0

This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some of the critical parameters that affect the final translation quality, memory usage, training stability and training time, concluding each experiment with a set of recommendations for fellow researchers. In addition to confirming the general mantra "more data and larger models", we address scaling to multiple GPUs and provide practical tips for improved training regarding batch size, learning rate, warmup steps, maximum sentence length and checkpoint averaging. We hope that our observations will allow others to get better results given their particular hardware and data constraints.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2018

Scaling Neural Machine Translation

Sequence to sequence learning models still require several days to reach...
research
04/06/2020

Applying Cyclical Learning Rate to Neural Machine Translation

In training deep learning networks, the optimizer and related learning r...
research
09/02/2019

Hybrid Data-Model Parallel Training for Sequence-to-Sequence Recurrent Neural Network Machine Translation

Reduction of training time is an important issue in many tasks like pate...
research
04/26/2020

Experiments with LVT and FRE for Transformer model

In this paper, we experiment with Large Vocabulary Trick and Feature-ric...
research
12/16/2022

Reducing Sequence Length Learning Impacts on Transformer Models

Classification algorithms using Transformer architectures can be affecte...
research
01/24/2019

Large-Batch Training for LSTM and Beyond

Large-batch training approaches have enabled researchers to utilize larg...

Please sign up or login with your details

Forgot password? Click here to reset