Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction

09/07/2020
by   Carl Martin Rosenberg, et al.
0

Goal: We consider the problem of automatically grouping logs of runs that failed for the same underlying reasons, so that they can be treated more effectively, and investigate the following questions: (1) Does an approach developed to identify problems in system logs generalize to identifying problems in continuous deployment logs? (2) How does dimensionality reduction affect the quality of automated log clustering? (3) How does the criterion used for merging clusters in the clustering algorithm affect clustering quality? Method: We replicate and extend earlier work on clustering system log files to assess its generalization to continuous deployment logs. We consider the optional inclusion of one of these dimensionality reduction techniques: Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), and Non-negative Matrix Factorization (NMF). Moreover, we consider three alternative cluster merge criteria (Single Linkage, Average Linkage, and Weighted Linkage), in addition to the Complete Linkage criterion used in earlier work. We empirically evaluate the 16 resulting configurations on continuous deployment logs provided by our industrial collaborator. Results: Our study shows that (1) identifying problems in continuous deployment logs via clustering is feasible, (2) including NMF significantly improves overall accuracy and robustness, and (3) Complete Linkage performs best of all merge criteria analyzed. Conclusions: We conclude that problem identification via automated log clustering is improved by including dimensionality reduction, as it decreases the pipeline's sensitivity to parameter choice, thereby increasing its robustness for handling different inputs.

READ FULL TEXT
research
04/22/2022

Compressibility: Power of PCA in Clustering Problems Beyond Dimensionality Reduction

In this paper we take a step towards understanding the impact of princip...
research
08/16/2020

Spectrum-Based Log Diagnosis

We present and evaluate Spectrum-Based Log Diagnosis (SBLD), a method to...
research
10/31/2019

Solving NMF with smoothness and sparsity constraints using PALM

Non-negative matrix factorization is a problem of dimensionality reducti...
research
09/01/2021

Selecting Optimal Trace Clustering Pipelines with AutoML

Trace clustering has been extensively used to preprocess event logs. By ...
research
11/15/2020

An efficient label-free analyte detection algorithm for time-resolved spectroscopy

Time-resolved spectral techniques play an important analysis tool in man...
research
07/22/2014

Resolution-limit-free and local Non-negative Matrix Factorization quality functions for graph clustering

Many graph clustering quality functions suffer from a resolution limit, ...

Please sign up or login with your details

Forgot password? Click here to reset