How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures

by   Carlos Pachajoa, et al.

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation. In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8 of the improved resilience depends on the sparsity pattern of the system matrix.


page 1

page 2

page 3

page 4


Scalable Resilience Against Node Failures for Communication-Hiding Preconditioned Conjugate Gradient and Conjugate Residual Methods

The observed and expected continued growth in the number of nodes in lar...

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

As computers reach exascale and beyond, the incidence of faults will inc...

Deterministic Data Distribution for Efficient Recovery in Erasure-Coded Storage Systems

Due to individual unreliable commodity components, failures are common i...

MODC: Resilience for disaggregated memory architectures using task-based programming

Disaggregated memory architectures provide benefits to applications beyo...

FT-GCR: a fault-tolerant generalized conjugate residual elliptic solver

With the steady advance of high performance computing systems featuring ...

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

This work is based on the seminar titled “Resiliency in Numerical Algori...

Rapid Recovery of Program Execution Under Power Failures for Embedded Systems with NVM

After power is switched on, recovering the interrupted program from the ...

Please sign up or login with your details

Forgot password? Click here to reset