RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk

by   Marcus Kalander, et al.

Failures and anomalies in large-scale software systems are unavoidable incidents. When an issue is detected, operators need to quickly and correctly identify its location to facilitate a swift repair. In this work, we consider the problem of identifying the root cause set that best explains an anomaly in multi-dimensional time series with categorical attributes. The huge search space is the main challenge, even for a small number of attributes and small value sets, the number of theoretical combinations is too large to brute force. Previous approaches have thus focused on reducing the search space, but they all suffer from various issues, requiring extensive manual parameter tuning, being too slow and thus impractical, or being incapable of finding more complex root causes. We propose RiskLoc to solve the problem of multidimensional root cause localization. RiskLoc applies a 2-way partitioning scheme and assigns element weights that linearly increase with the distance from the partitioning point. A risk score is assigned to each element that integrates two factors, 1) its weighted proportion within the abnormal partition, and 2) the relative change in the deviation score adjusted for the ripple effect property. Extensive experiments on multiple datasets verify the effectiveness and efficiency of RiskLoc, and for a comprehensive evaluation, we introduce three synthetically generated datasets that complement existing datasets. We demonstrate that RiskLoc consistently outperforms state-of-the-art baselines, especially in more challenging root cause scenarios, with gains in F1-score up to 57


Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems

Localizing root causes for multi-dimensional data is critical to ensure ...

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

In large-scale online services, crucial metrics, a.k.a., key performance...

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Availability issues of industrial microservice systems (e.g., drop of su...

Root Cause Identification for Collective Anomalies in Time Series given an Acyclic Summary Causal Graph with Loops

This paper presents an approach for identifying the root causes of colle...

BALANCE: Bayesian Linear Attribution for Root Cause Localization

Root Cause Analysis (RCA) plays an indispensable role in distributed dat...

Constructing Large-Scale Real-World Benchmark Datasets for AIOps

Recently, AIOps (Artificial Intelligence for IT Operations) has been wel...

Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data

The complexity and dynamism of microservices pose significant challenges...

Please sign up or login with your details

Forgot password? Click here to reset