An Adaptive Resilience Testing Framework for Microservice Systems

by   Tianyi Yang, et al.

Resilience testing, which measures the ability to minimize service degradation caused by unexpected failures, is crucial for microservice systems. The current practice for resilience testing relies on manually defining rules for different microservice systems. Due to the diverse business logic of microservices, there are no one-size-fits-all microservice resilience testing rules. As the quantity and dynamic of microservices and failures largely increase, manual configuration exhibits its scalability and adaptivity issues. To overcome the two issues, we empirically compare the impacts of common failures in the resilient and unresilient deployments of a benchmark microservice system. Our study demonstrates that the resilient deployment can block the propagation of degradation from system performance metrics (e.g., memory usage) to business metrics (e.g., response latency). In this paper, we propose AVERT, the first AdaptiVE Resilience Testing framework for microservice systems. AVERT first injects failures into microservices and collects available monitoring metrics. Then AVERT ranks all the monitoring metrics according to their contributions to the overall service degradation caused by the injected failures. Lastly, AVERT produces a resilience index by how much the degradation in system performance metrics propagates to the degradation in business metrics. The higher the degradation propagation, the lower the resilience of the microservice system. We evaluate AVERT on two open-source benchmark microservice systems. The experimental results show that AVERT can accurately and efficiently test the resilience of microservice systems.


Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the ...

Do Resilience Metrics of Water Distribution Systems Really Assess Resilience? A Critical Review

Having become vital to satisfying basic human needs, water distribution ...

The LDBC Graphalytics Benchmark

In this document, we describe LDBC Graphalytics, an industrial-grade ben...

Infrastructure Resilience Curves: Performance Measures and Summary Metrics

Resilience curves communicate system behavior and resilience properties ...

Towards Developing Resilient and Service-oriented Mission-critical Systems

Mission-critical systems (MCSs) have embraced new design paradigms such ...

Model-based Reinforcement Learning for Service Mesh Fault Resiliency in a Web Application-level

Microservice-based architectures enable different aspects of web applica...

Supporting Early-Safety Analysis of IoT Systems by Exploiting Testing Techniques

IoT systems complexity and susceptibility to failures pose significant c...

Please sign up or login with your details

Forgot password? Click here to reset