Optimizing High-Performance Linpack for Exascale Accelerated Architectures

04/20/2023
by   Noel Chalmers, et al.
0

We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes.

READ FULL TEXT

page 2

page 3

page 4

page 7

research
02/25/2022

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

We present hipBone, an open source performance-portable proxy applicatio...
research
11/30/2022

GPU-Accelerated DNS of Compressible Turbulent Flows

This paper explores strategies to transform an existing CPU-based high-p...
research
11/18/2022

PIM-tree: A Skew-resistant Index for Processing-in-Memory

The performance of today's in-memory indexes is bottlenecked by the memo...
research
02/07/2020

Breaking Band: A Breakdown of High-performance Communication

The critical path of internode communication on large-scale systems is c...
research
04/10/2018

Implementing Push-Pull Efficiently in GraphBLAS

We factor Beamer's push-pull, also known as direction-optimized breadth-...
research
05/21/2020

Signal Processing for a Reverse-GPS Wildlife Tracking System: CPU and GPU Implementation Experiences

We present robust high-performance implementations of signal-processing ...
research
07/07/2020

A Task-based Multi-shift QR/QZ Algorithm with Aggressive Early Deflation

The QR algorithm is one of the three phases in the process of computing ...

Please sign up or login with your details

Forgot password? Click here to reset