Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

by   Valentin Radu, et al.

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12 size, is in some cases detrimental to performance, leading to 2x slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.


page 1

page 7


Efficient Inference of CNNs via Channel Pruning

The deployment of Convolutional Neural Networks (CNNs) on resource const...

PRUNIX: Non-Ideality Aware Convolutional Neural Network Pruning for Memristive Accelerators

In this work, PRUNIX, a framework for training and pruning convolutional...

PCAS: Pruning Channels with Attention Statistics

To implement deep neural networks on small embedded devices, conventiona...

PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks

In this paper, we present PARTIME, a software library written in Python ...

Pruning Algorithms to Accelerate Convolutional Neural Networks for Edge Applications: A Survey

With the general trend of increasing Convolutional Neural Network (CNN) ...

Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction

This work is focused on the pruning of some convolutional neural network...

FasterAI: A Lightweight Library for Creating Sparse Neural Networks

FasterAI is a PyTorch-based library, aiming to facilitate the utilizatio...

Please sign up or login with your details

Forgot password? Click here to reset