PARIS: Predicting Application Resilience Using Machine Learning

by   Luanzheng Guo, et al.

Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recently proposed to predict application resilience in a faster way than fault injection, it can only predict a single class of fault manifestation (SDC) and there is no evidence demonstrating that it can work on previously unseen programs, which greatly limits its re-usability. We present PARIS, a resilience prediction method that addresses the problems of existing prediction methods using machine learning. Using carefully-selected features and a machine learning model, our method is able to make resilience predictions of three classes of fault manifestations (success, SDC, and interruption) as opposed to one class like in current resilience prediction modeling. The generality of our approach allows us to make prediction on new applications, i.e., previously unseen applications, providing large applicability to our model. Our evaluation on 125 programs shows that PARIS provides high prediction accuracy, 82 predicting the rate of success and interruption, respectively, while the state-of-the-art resilience prediction model cannot predict them. When predicting the rate of SDC, PARIS provides much better accuracy than the state-of-the-art (38 than the traditional method (random fault injection).


page 1

page 2

page 3

page 4


MOARD: Modeling Application Resilience to Transient Faults on Data Objects

Understanding application resilience (or error tolerance) in the presenc...

TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications

As machine learning (ML) has seen increasing adoption in safety-critical...

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

In this paper, we use modeling and prediction tool MuMMI (Multiple Metri...

Model-based Reinforcement Learning for Service Mesh Fault Resiliency in a Web Application-level

Microservice-based architectures enable different aspects of web applica...

Machine Learning Data Suitability and Performance Testing Using Fault Injection Testing Framework

Creating resilient machine learning (ML) systems has become necessary to...

Estimating Silent Data Corruption Rates Using a Two-Level Model

High-performance and safety-critical system architects must accurately e...

Please sign up or login with your details

Forgot password? Click here to reset