A Doubly-pipelined, Dual-root Reduction-to-all Algorithm and Implementation

09/26/2021
by   Jesper Larsson Träff, et al.
0

We discuss a simple, binary tree-based algorithm for the collective allreduce (reduction-to-all, MPI_Allreduce) operation for parallel systems consisting of p suitably interconnected processors. The algorithm can be doubly pipelined to exploit bidirectional (telephone-like) communication capabilities of the communication system. In order to make the algorithm more symmetric, the processors are organized into two rooted trees with communication between the two roots. For each pipeline block, each non-leaf processor takes three communication steps, consisting in receiving and sending from and to the two children, and sending and receiving to and from the root. In a round-based, uniform, linear-cost communication model in which simultaneously sending and receiving n data elements takes time α+β n for system dependent constants α (communication start-up latency) and β (time per element), the time for the allreduce operation on vectors of m elements is O(log p+√(mlog p))+3β m by suitable choice of the pipeline block size. We compare the performance of an implementation in MPI to similar reduce followed by broadcast algorithms, and the native MPI_Allreduce collective on a modern, small 36× 32 processor cluster. With proper choice of the number of pipeline blocks, it is possible to achieve better performance than pipelined algorithms that do not exploit bidirectional communication.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/20/2022

(Poly)Logarithmic Time Construction of Round-optimal n-Block Broadcast Schedules for Broadcast and irregular Allgather in MPI

We give a fast(er), communication-free, parallel construction of optimal...
research
11/23/2017

On Optimal Trees for Irregular Gather and Scatter Collectives

This paper studies the complexity of finding cost-optimal communication ...
research
08/27/2020

k-ported vs. k-lane Broadcast, Scatter, and Alltoall Algorithms

In k-ported message-passing systems, a processor can simultaneously rece...
research
05/10/2022

All-to-All Encode in Synchronous Systems

We define all-to-all encode, a collective communication operation servin...
research
10/29/2019

Decomposing Collectives for Exploiting Multi-lane Communication

Many modern, high-performance systems increase the cumulated node-bandwi...
research
07/13/2022

Four-splitting based coarse-grained multicomputer parallel algorithm for the optimal binary search tree problem

This paper presents a parallel solution based on the coarse-grained mult...
research
04/20/2020

A Generalization of the Allreduce Operation

Allreduce is one of the most frequently used MPI collective operations, ...

Please sign up or login with your details

Forgot password? Click here to reset