Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

by   Ioannis Vardas, et al.

HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem. Apart from communication cost, the completion time of a job can also be affected by node failures. Node failures may result in job abortions, requiring job restarts. In this paper, we address the problem of assigning processes to system resources with the goal of reducing communication cost while also taking into account node failures. The proposed approach is integrated into the Slurm resource manager. Evaluation results show that, in scenarios where few nodes have a low outage probability, the proposed process placement approach achieves a notable decrease in the completion time of batches of MPI jobs. Compared to the default process placement approach in Slurm, the reduction is 18.9 two different MPI applications.


Open-MPI over MOSIX: paralleled computing in a clustered world

Recent increased interest in Cloud computing emphasizes the need to find...

Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

High-performance computing continues to increase its computing power and...

Energy-efficient localised rollback after failures via data flow analysis

Exascale systems will suffer failures hourly. HPC programmers rely mostl...

Geometric Partitioning and Ordering Strategies for Task Mapping on Parallel Computers

We present a new method for mapping applications' MPI tasks to cores of ...

Improving MPI Collective I/O Performance With Intra-node Request Aggregation

Two-phase I/O is a well-known strategy for implementing collective MPI-I...

Scalable Resilience Against Node Failures for Communication-Hiding Preconditioned Conjugate Gradient and Conjugate Residual Methods

The observed and expected continued growth in the number of nodes in lar...

COUNTDOWN Slack: a Run-time Library to Reduce Energy Footprint in Large-scale MPI Applications

The power consumption of supercomputers is a major challenge for system ...

Please sign up or login with your details

Forgot password? Click here to reset