NVM-ESR: Using Non-Volatile Memory in Exact State Reconstruction of Preconditioned Conjugate Gradient

by   Yehonatan Fridman, et al.

HPC systems are a critical resource for scientific research and advanced industries. The demand for computational power and memory is increasing and ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of many compute nodes and are consequently expected to experience frequent faults and crashes. Exact state reconstruction (ESR) has been proposed as a mechanism to alleviate the impact of frequent failures on long-term computations. ESR has shown great potential in the context of iterative linear algebra solvers, a key building block in numerous scientific applications. Recent designs of supercomputers feature the emerging nonvolatile memory (NVM) technology. For example, the Exascale Aurora supercomputer is planned to integrate Intel Optane DCPMM. This work investigates how NVM can be used to improve ESR so that it can scale to future exascale systems such as Aurora and provide enhanced resilience. We propose the non-volatile memory ESR (NVM-ESR) mechanism. NVM-ESR demonstrates how NVM can be utilized in supercomputers for enabling efficient recovery from faults while requiring significantly smaller memory footprint and time overheads in comparison to ESR. We focus on the preconditioned conjugate gradient (PCG) iterative solver also studied in prior ESR research, because it is employed by the representative HPCG scientific benchmark.


page 5

page 7


Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

In recent years, the increasing complexity in scientific simulations and...

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Reliability is a serious concern for future extreme-scale high-performan...

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

In this paper, we present benchmark data for Intel Memory Drive Technolo...

Image Gradient Decomposition for Parallel and Memory-Efficient Ptychographic Reconstruction

Ptychography is a popular microscopic imaging modality for many scientif...

FlipTracker: Understanding Natural Error Resilience in HPC Applications

As high-performance computing systems scale in size and computational po...

Exploiting Inter-Operation Data Reuse in Scientific Applications using GOGETA

HPC applications are critical in various scientific domains ranging from...

Adaptive control in rollforward recovery for extreme scale multigrid

With the increasing number of compute components, failures in future exa...

Please sign up or login with your details

Forgot password? Click here to reset