FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

11/24/2019
by   Guoping Long, et al.
0

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose FusionStitching, a optimization framework capable of fusing memory intensive elementwise, reduction and fine grained GEMM/Batched-GEMM ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, FusionStitching can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).

READ FULL TEXT
research
09/23/2020

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

We show in this work that memory intensive computations can result in se...
research
11/13/2018

FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs

In recent years, there is a surge on machine learning applications in in...
research
07/02/2020

Automatic Horizontal Fusion for GPU Kernels

We present automatic horizontal fusion, a novel optimization technique t...
research
02/12/2019

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

GPU computing is becoming increasingly more popular with the proliferati...
research
09/04/2023

Understanding and Optimizing Serverless Workloads in CXL-Enabled Tiered Memory

Recent Serverless workloads tend to be largescaled/CPU-memory intensive,...
research
10/12/2018

ISA Mapper: A Compute and Hardware Agnostic Deep Learning Compiler

Domain specific accelerators present new challenges and opportunities fo...
research
12/04/2021

Understanding the Limits of Conventional Hardware Architectures for Deep-Learning

Deep learning and hardware for it has garnered immense academic and indu...

Please sign up or login with your details

Forgot password? Click here to reset