End-to-End Neural Network Compression via ℓ_1/ℓ_2 Regularized Latency Surrogates

06/09/2023
by   Anshul Nasery, et al.
0

Neural network (NN) compression via techniques such as pruning, quantization requires setting compression hyperparameters (e.g., number of channels to be pruned, bitwidths for quantization) for each layer either manually or via neural architecture search (NAS) which can be computationally expensive. We address this problem by providing an end-to-end technique that optimizes for model's Floating Point Operations (FLOPs) or for on-device latency via a novel ℓ_1/ℓ_2 latency surrogate. Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization. Crucially, it is fast and runs in almost the same amount of time as single model training; which is a significant training speed-up over standard NAS methods. For BERT compression on GLUE fine-tuning tasks, we achieve 50% reduction in FLOPs with only 1% drop in performance. For compressing MobileNetV3 on ImageNet-1K, we achieve 15% reduction in FLOPs, and 11% reduction in on-device latency without drop in accuracy, while still requiring 3× less training compute than SOTA compression techniques. Finally, for transfer learning on smaller datasets, our technique identifies 1.2×-1.4× cheaper architectures than standard MobileNetV3, EfficientNet suite of architectures at almost the same training cost and accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2023

Search-time Efficient Device Constraints-Aware Neural Architecture Search

Edge computing aims to enable edge devices, such as IoT devices, to proc...
research
01/28/2023

Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming

Recent works on neural network pruning advocate that reducing the depth ...
research
01/15/2022

UDC: Unified DNAS for Compressible TinyML Models

Emerging Internet-of-things (IoT) applications are driving deployment of...
research
02/15/2023

Towards Optimal Compression: Joint Pruning and Quantization

Compression of deep neural networks has become a necessary stage for opt...
research
01/27/2023

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Improving the deployment efficiency of transformer-based language models...
research
03/12/2019

Cascaded Projection: End-to-End Network Compression and Acceleration

We propose a data-driven approach for deep convolutional neural network ...
research
11/21/2022

Learning Low-Rank Representations for Model Compression

Vector Quantization (VQ) is an appealing model compression method to obt...

Please sign up or login with your details

Forgot password? Click here to reset