Unified Normalization for Accelerating and Stabilizing Transformers

08/02/2022
by   Qiming Yang, et al.
0

Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31 18 https://github.com/hikvision-research/Unified-Normalization.

READ FULL TEXT
research
12/05/2021

Dynamic Token Normalization Improves Vision Transformer

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieve...
research
10/11/2022

Understanding the Failure of Batch Normalization for Transformers in NLP

Batch Normalization (BN) is a core and prevalent technique in accelerati...
research
01/19/2020

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Batch Normalization (BN) is one of the most widely used techniques in De...
research
05/24/2023

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning application...
research
10/12/2022

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

This paper studies the curious phenomenon for machine learning models wi...
research
10/31/2022

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

The recently proposed Conformer architecture which combines convolution ...
research
07/07/2020

Calibrated BatchNorm: Improving Robustness Against Noisy Weights in Neural Networks

Analog computing hardware has gradually received more attention by the r...

Please sign up or login with your details

Forgot password? Click here to reset