Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver

by   Weile Wei, et al.

Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper discusses how device memory-bound challenges were successfully reduced by proposing an effective "all-to-all" communication method – a ring communication algorithm. This implementation takes advantage of acceleration on GPUs and remote direct memory access (RDMA) for fast data exchange between GPUs. Additionally, the ring algorithm was optimized with sub-ring communicators and multi-threaded support to further reduce communication overhead and expose more concurrency, respectively. The computation and communication were also analyzed by using the Autonomic Performance Environment for Exascale (APEX) profiling tool, and this paper further discusses the performance trade-off for the ring algorithm implementation. The memory analysis on the ring algorithm shows that the allocation size for the authors' most memory-intensive data structure per GPU is now reduced to 1/p of the original size, where p is the number of GPUs in the ring communicator. The communication analysis suggests that the distributed Quantum Monte Carlo execution time grows linearly as sub-ring size increases, and the cost of messages passing through the network interface connector could be a limiting factor.


page 1

page 2

page 3

page 4


MGPU-TSM: A Multi-GPU System with Truly Shared Memory

The sizes of GPU applications are rapidly growing. They are exhausting t...

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Dask is a popular parallel and distributed computing framework, which ri...

Non-trivial lower bound for 3-coloring the ring in the quantum LOCAL model

We consider the LOCAL model of distributed computing, where in a single ...

Noisy Tensor Ring approximation for computing gradients of Variational Quantum Eigensolver for Combinatorial Optimization

Variational Quantum algorithms, especially Quantum Approximate Optimizat...

Performance Analysis of a Quantum Monte Carlo Application on Multiple Hardware Architectures Using the HPX Runtime

This paper describes how we successfully used the HPX programming model ...

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...

RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

Deep learning emerges as an important new resource-intensive workload an...

Please sign up or login with your details

Forgot password? Click here to reset