GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

05/08/2020
by   Ali Hadi Zadeh, et al.
0

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9 state-of-the-art BERT models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. We present two practical hardware applications of GOBO. In the first GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism is plug-in compatible with many architectures; we demonstrate it with the TPU, Eyeriss, and an architecture using Tensor Cores-like units. Second, we present a co-designed hardware architecture that also reduces computation. Uniquely, the GOBO architecture maintains most of the weights in 3b even during computation, a property that: (1) makes the processing elements area efficient, allowing us to pack more compute power per unit area, (2) replaces most multiply-accumulations with additions, and (3) reduces the off-chip traffic by amplifying on-chip memory capacity.

READ FULL TEXT
research
03/25/2023

Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

Executing machine learning inference tasks on resource-constrained edge ...
research
01/21/2022

APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference

Data accesses between on- and off-chip memories account for a large frac...
research
06/20/2022

nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

The recent advance of self-supervised learning associated with the Trans...
research
09/16/2023

A Low-Latency FFT-IFFT Cascade Architecture

This paper addresses the design of a partly-parallel cascaded FFT-IFFT a...
research
07/27/2023

Scaling TransNormer to 175 Billion Parameters

We present TransNormerLLM, the first linear attention-based Large Langua...
research
05/22/2022

Wireless On-Chip Communications for Scalable In-memory Hyperdimensional Computing

Hyperdimensional computing (HDC) is an emerging computing paradigm that ...
research
02/08/2020

BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization

Neural networks have demonstrably achieved state-of-the art accuracy usi...

Please sign up or login with your details

Forgot password? Click here to reset