High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

08/23/2021
by   Navdeep Katel, et al.
0

This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized highly tuned libraries. The approach to develop such libraries is often not modular or reusable to the same extent that compiler infrastructure like LLVM is. Manual optimization typically does not use a standard intermediate representation (IR), although the optimizations performed can be encoded as a sequence of transformation steps and customized passes on an IR. Hand tuning may also miss exploration of design points only reachable easily by automatic code generation. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), IR infrastructure was not geared to tackle the problem of automatic generation of domain-specific libraries in an effective manner. In particular, it was hard to represent and transform compute abstractions at high, middle, and low levels using a single IR. With suitable abstractions in MLIR, we build an experimental lowering pipeline that is able to automatically generate code for matrix-matrix multiplication on NVIDIA GPUs targeting its tensor cores. On a set of problem sizes we evaluated, initial performance results show that we are able to attain performance that is 95-119 respectively on NVIDIA's Ampere microarchitecture-based Geforce 3090 RTX. We believe that these results could be used as motivation for further research and development on automatic code and library generation using IR infrastructure for similar specialized accelerators.

READ FULL TEXT
research
03/01/2020

High Performance Code Generation in MLIR: An Early Case Study with GEMM

This article is primarily meant to present an early case study on using ...
research
03/14/2019

Stripe: Tensor Compilation via the Nested Polyhedral Model

Hardware architectures and machine learning (ML) libraries evolve rapidl...
research
09/25/2020

Flexible Performant GEMM Kernels on GPUs

General Matrix Multiplication or GEMM kernels take centre place in high ...
research
06/22/2020

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform ...
research
08/01/2023

Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries

FFTc is a Domain-Specific Language (DSL) for designing and generating Fa...
research
10/29/2022

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler

Pipelining between data loading and computation is a critical tensor pro...
research
05/26/2020

Domain-Specific Multi-Level IR Rewriting for GPU

Traditional compilers operate on a single generic intermediate represent...

Please sign up or login with your details

Forgot password? Click here to reset