GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

by   Jaemin Choi, et al.

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.2x, 11.7x, and 17.4x in Charm++, AMPI, and Charm4py, respectively. We also observe increases in bandwidth of up to 9.6x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.


page 1

page 3

page 4

page 5


gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for mod...

Efficient Embedding of MPI Collectives in MXNET DAGs for scaling Deep Learning

Availability of high performance computing infrastructures such as clust...

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

We present hipBone, an open source performance-portable proxy applicatio...

Evaluating the performance portability of SYCL across CPUs and GPUs on bandwidth-bound applications

In this paper, we evaluate the portability of the SYCL programming model...

An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Applications for deep learning and big data analytics have compute and m...

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

High performance multi-GPU computing becomes an inevitable trend due to ...

IMEXLBM 1.0: A Proxy Application based on the Lattice Boltzmann Method for solving Computational Fluid Dynamic problems on GPUs

The US Department of Energy launched the Exascale Computing Project (ECP...

Please sign up or login with your details

Forgot password? Click here to reset