Improving Performance of Iterative Methods by Lossy Checkponting

04/30/2018
by   DingDingwen Tao, et al.
0

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23 lossless-compressed checkpointing, in the presence of system failures.

READ FULL TEXT
research
10/17/2018

Fault Tolerance in Iterative-Convergent Machine Learning

Machine learning (ML) training algorithms often possess an inherent self...
research
03/04/2020

Stability Analysis of Inline ZFP Compression for Floating-Point Data in Iterative Methods

Currently, the dominating constraint in many high performance computing ...
research
04/02/2021

FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific ...
research
02/28/2023

i2LQR: Iterative LQR for Iterative Tasks in Dynamic Environments

This work introduces a novel control strategy called Iterative Linear Qu...
research
05/16/2018

FT-LADS: Fault-Tolerant Object-Logging based Big Data Transfer System using Layout-Aware Data Scheduling

Layout-Aware Data Scheduler (LADS) data transfer tool, identifies and ad...
research
09/02/2019

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

The increase in HPC systems size and complexity, together with increasin...

Please sign up or login with your details

Forgot password? Click here to reset