GEMMFIP: Unifying GEMM in BLIS

02/16/2023
by   Ruqing G. Xu, et al.
0

Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (GEMM) that can achieve high performance for both small and large problem sizes. The key is to fuse packing – an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance – with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.

READ FULL TEXT
research
05/03/2016

Implementing Strassen's Algorithm with BLIS

We dispel with "street wisdom" regarding the practical implementation of...
research
11/03/2016

Generating Families of Practical Fast Matrix Multiplication Algorithms

Matrix multiplication (GEMM) is a core operation to numerous scientific ...
research
05/15/2023

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

The resurgence of machine learning has increased the demand for high-per...
research
09/01/2016

BLISlab: A Sandbox for Optimizing GEMM

Matrix-matrix multiplication is a fundamental operation of great importa...
research
05/08/2019

Performance Engineering for a Tall Skinny Matrix Multiplication Kernel on GPUs

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS lib...
research
11/15/2017

PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

This paper describes PlinyCompute, a system for development of high-perf...
research
04/12/2023

MEMA Runtime Framework: Minimizing External Memory Accesses for TinyML on Microcontrollers

We present the MEMA framework for the easy and quick derivation of effic...

Please sign up or login with your details

Forgot password? Click here to reset