Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

03/07/2022
by   Hiroyuki Ootomo, et al.
0

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix-matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA's CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51TFlop/s for a limited exponent range using FP16 Tensor Cores and 33TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5TFlop/s.

READ FULL TEXT

page 7

page 17

page 20

page 23

research
06/21/2023

DGEMM on Integer Matrix Multiplication Unit

Deep learning hardware achieves high throughput and low power consumptio...
research
03/11/2018

NVIDIA Tensor Core Programmability, Performance & Precision

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, ca...
research
06/06/2022

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Tensor Cores have been an important unit to accelerate Fused Matrix Mult...
research
09/29/2020

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential com...
research
04/10/2023

Mixed-Precision Random Projection for RandNLA on Tensor Cores

Random projection can reduce the dimension of data while capturing its s...
research
08/25/2021

A TensorFlow Simulation Framework for Scientific Computing of Fluid Flows on Tensor Processing Units

A computational fluid dynamics (CFD) simulation framework for predicting...
research
02/11/2022

Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators

Tensor accelerators have gained popularity because they provide a cheap ...

Please sign up or login with your details

Forgot password? Click here to reset