GC3: An Optimizing Compiler for GPU Collective Communication

01/27/2022
by   Meghan Cowan, et al.
0

Machine learning models made up of millions or billions of parameters are often trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can alleviate this bottleneck and thus help these applications scale. This paper introduces GC3, a system designed to make GPU communication programmable. GC3 provides a data oriented domain specific language for writing custom collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter based runtime. We used GC3 to write novel collective implementations for AllReduce and AllToAll that are up to 48 than optimized vendor implementations, respectively. We also demonstrate how directly implementing an application specific collective called AllToNext in GC3 results in a 14.5 speedup over the baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/19/2020

Synthesizing Optimal Collective Algorithms

Collective communication algorithms are an important component of distri...
research
11/08/2021

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU syst...
research
08/09/2023

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for mod...
research
10/11/2017

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

We present Synkhronos, an extension to Theano for multi-GPU computations...
research
11/17/2019

Optimizing Ordered Graph Algorithms with GraphIt

Many graph problems can be solved using ordered parallel graph algorithm...
research
06/28/2023

Collective-Optimized FFTs

This paper measures the impact of the various alltoallv methods. Results...
research
10/20/2021

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...

Please sign up or login with your details

Forgot password? Click here to reset