Im2win: An Efficient Convolution Paradigm on GPU

by   Shuai Lu, et al.

Convolution is the most time-consuming operation in deep neural network operations, so its performance is critical to the overall performance of the neural network. The commonly used methods for convolution on GPU include the general matrix multiplication (GEMM)-based convolution and the direct convolution. GEMM-based convolution relies on the im2col algorithm, which results in a large memory footprint and reduced performance. Direct convolution does not have the large memory footprint problem, but the performance is not on par with GEMM-based approach because of the discontinuous memory access. This paper proposes a window-order-based convolution paradigm on GPU, called im2win, which not only reduces memory footprint but also offers continuous memory accesses, resulting in improved performance. Furthermore, we apply a range of optimization techniques on the convolution CUDA kernel, including shared memory, tiling, micro-kernel, double buffer, and prefetching. We compare our implementation with the direct convolution, and PyTorch's GEMM-based convolution with cuBLAS and six cuDNN-based convolution implementations, with twelve state-of-the-art DNN benchmarks. The experimental results show that our implementation 1) uses less memory footprint by 23.1 TFLOPS compared with cuBLAS, 2) uses less memory footprint by 32.8 achieves up to 1.8× TFLOPS compared with the best performant convolutions in cuDNN, and 3) achieves up to 155× TFLOPS compared with the direct convolution. We further perform an ablation study on the applied optimization techniques and find that the micro-kernel has the greatest positive impact on performance.


page 1

page 2

page 3

page 4


Im2win: Memory Efficient Convolution On SIMD Architectures

Convolution is the most expensive operation among neural network operati...

MEC: Memory-efficient Convolution for Deep Neural Network

Convolution is a critical component in modern deep neural networks, thus...

μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently...

Winograd Convolution for DNNs: Beyond linear polinomials

We investigated a wider range of Winograd family convolution algorithms ...

The Indirect Convolution Algorithm

Deep learning frameworks commonly implement convolution operators with G...

I/O Lower Bounds for Auto-tuning of Convolutions in CNNs

Convolution is the most time-consuming part in the computation of convol...

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Convolution is one of the most computationally intensive operations that...

Please sign up or login with your details

Forgot password? Click here to reset