Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

by   Guyue Huang, et al.

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss. In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75


page 1

page 2

page 3

page 4


Accelerating Sparse Deep Neural Networks

As neural network model sizes have dramatically increased, so has the in...

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

With the fast growth of parameter size, it becomes increasingly challeng...

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Network pruning can reduce the high computation cost of deep neural netw...

Dual-side Sparse Tensor Core

Leveraging sparsity in deep neural network (DNN) models is promising for...

Escort: Efficient Sparse Convolutional Neural Networks on GPUs

Deep neural networks have achieved remarkable accuracy in many artificia...

ADAM-ADMM: A Unified, Systematic Framework of Structured Weight Pruning for DNNs

Weight pruning methods of deep neural networks (DNNs) have been demonstr...

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been ...

Please sign up or login with your details

Forgot password? Click here to reset