Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

12/29/2020
by   Ioannis Vardas, et al.
0

HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem. Apart from communication cost, the completion time of a job can also be affected by node failures. Node failures may result in job abortions, requiring job restarts. In this paper, we address the problem of assigning processes to system resources with the goal of reducing communication cost while also taking into account node failures. The proposed approach is integrated into the Slurm resource manager. Evaluation results show that, in scenarios where few nodes have a low outage probability, the proposed process placement approach achieves a notable decrease in the completion time of batches of MPI jobs. Compared to the default process placement approach in Slurm, the reduction is 18.9 two different MPI applications.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset