MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

by   Dewei Liu, et al.

Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68 minutes.


Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data

The complexity and dynamism of microservices pose significant challenges...

Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings

For large-scale distributed systems, it's crucial to efficiently diagnos...

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

As business of Alibaba expands across the world among various industries...

BALANCE: Bayesian Linear Attribution for Root Cause Localization

Root Cause Analysis (RCA) plays an indispensable role in distributed dat...

RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk

Failures and anomalies in large-scale software systems are unavoidable i...

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

In large-scale online services, crucial metrics, a.k.a., key performance...

Simple Root Cause Analysis by Separable Likelihoods

Root Cause Analysis for Anomalies is challenging because of the trade-of...

Please sign up or login with your details

Forgot password? Click here to reset