Packing: Towards 2x NLP BERT Acceleration

06/29/2021
by   Matej Kosec, et al.
0

We find that at sequence length 512 padding tokens represent in excess of 50 of the Wikipedia dataset used for pretraining BERT (Bidirectional Encoder Representations from Transformers). Therefore by removing all padding we achieve a 2x speed-up in terms of sequences/sec. To exploit this characteristic of the dataset, we develop and contrast two deterministic packing algorithms. Both algorithms rely on the assumption that sequences are interchangeable and therefore packing can be performed on the histogram of sequence lengths, rather than per sample. This transformation of the problem leads to algorithms which are fast and have linear complexity in dataset size. The shortest-pack-first histogram-packing (SPFHP) algorithm determines the packing order for the Wikipedia dataset of over 16M sequences in 0.02 seconds. The non-negative least-squares histogram-packing (NNLSHP) algorithm converges in 28.4 seconds but produces solutions which are more depth efficient, managing to get near optimal packing by combining a maximum of 3 sequences in one sample. Using the dataset with multiple sequences per sample requires additional masking in the attention layer and a modification of the MLM loss function. We demonstrate that both of these changes are straightforward to implement and have relatively little impact on the achievable performance gain on modern hardware. Finally, we pretrain BERT-Large using the packed dataset, demonstrating no loss of convergence and the desired 2x speed-up.

READ FULL TEXT

page 12

page 26

page 27

page 29

page 30

page 31

page 32

research
02/16/2018

A Reallocation Algorithm for Online Split Packing of Circles

The Split Packing algorithm is an offline algorithm that packs a set of ...
research
07/25/2019

Performance Evaluation of Two-layer lossless HDR Coding using Histogram Packing Technique under Various Tone-mapping Operators

We proposed a lossless two-layer HDR coding method using a histogram pac...
research
08/02/2018

Two-Layer Lossless HDR Coding using Histogram Packing Technique with Backward Compatibility to JPEG

An efficient two-layer coding method using the histogram packing techniq...
research
06/30/2022

Neural Network Assisted Depth Map Packing for Compression Using Standard Hardware Video Codecs

Depth maps are needed by various graphics rendering and processing opera...
research
05/09/2019

Two-layer Near-lossless HDR Coding with Backward Compatibility to JPEG

We propose an efficient two-layer near-lossless coding method using an e...
research
06/28/2018

Two-layer Lossless HDR Coding considering Histogram Sparseness with Backward Compatibility to JPEG

An efficient two-layer coding method using the histogram packing techniq...

Please sign up or login with your details

Forgot password? Click here to reset