DeepNet: Scaling Transformers to 1,000 Layers

03/01/2022
by   Hongyu Wang, et al.
0

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

READ FULL TEXT

page 21

page 22

research
11/08/2019

Why Deep Transformers are Difficult to Converge? From Computation Order to Lipschitz Restricted Parameter Initialization

The Transformer translation model employs residual connection and layer ...
research
06/01/2022

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...
research
05/04/2023

BranchNorm: Robustly Scaling Extremely Deep Transformers

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 l...
research
09/28/2020

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...
research
04/17/2020

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...
research
01/12/2021

Of Non-Linearity and Commutativity in BERT

In this work we provide new insights into the transformer architecture, ...
research
08/29/2019

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...

Please sign up or login with your details

Forgot password? Click here to reset