Hoplite: Efficient Collective Communication for Task-Based Distributed Systems

by   Siyuan Zhuang, et al.

Collective communication systems such as MPI offer high performance group communication primitives at the cost of application flexibility. Today, an increasing number of distributed applications (e.g, reinforcement learning) require flexibility in expressing dynamic and asynchronous communication patterns. To accommodate these applications, task-based distributed computing frameworks (e.g., Ray, Dask, Hydro) have become popular as they allow applications to dynamically specify communication by invoking tasks, or functions, at runtime. This design makes efficient collective communication challenging because (1) the group of communicating processes is chosen at runtime, and (2) processes may not all be ready at the same time. We design and implement Hoplite, a communication layer for task-based distributed systems that achieves high performance collective communication without compromising application flexibility. The key idea of Hoplite is to use distributed protocols to compute a data transfer schedule on the fly. This enables the same optimizations used in traditional collective communication, but for applications that specify the communication incrementally. We show that Hoplite can achieve similar performance compared with a traditional collective communication library, MPICH. We port a popular distributed computing framework, Ray, on atop of Hoplite. We show that Hoplite can speed up asynchronous parameter server and distributed reinforcement learning workloads that are difficult to execute efficiently with traditional collective communication by up to 8.1x and 3.9x, respectively.


page 1

page 2

page 3

page 4


SparCML: High-Performance Sparse Communication for Machine Learning

One of the main drivers behind the rapid recent advances in machine lear...

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...

Ray: A Distributed Framework for Emerging AI Applications

The next generation of AI applications will continuously interact with t...

RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning

Reinforcement learning (RL) tasks are challenging to implement, execute ...

Programming the Interactions of Collective Adaptive Systems by Relying on Attribute-based Communication

Collective adaptive systems are new emerging computational systems consi...

OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems

All-gather collective communication is one of the most important communi...

AI, Native Supercomputing and The Revival of Moore's Law

Based on Alan Turing's proposition on AI and computing machinery, which ...

Please sign up or login with your details

Forgot password? Click here to reset