Identifying Impactful Service System Problems via Log Analysis
- Shilin He ,
- Qingwei Lin 林庆维 ,
- ...
ESEC/FSE 2018 |
Logs are often used for troubleshooting in large-scale software systems. For a cloud-based online system that provides 24/7 service, a huge number of logs could be generated every day. However, these logs are highly imbalanced in general, because most logs indicate normal system operations, and only a small percentage of logs reveal impactful problems. Problems that lead to the decline of system KPIs (Key Performance Indicators) are impactful and should be fixed by engineers with a high priority. Furthermore, there are many types of system problems, which are hard to be distinguished manually. In this paper, we propose Log3C, a novel clustering-based approach to rapidly and precisely identify impactful system problems based on the analysis of log sequences (a sequence of log events) as well as system KPIs. More specifically, we design a novel cascading clustering algorithm, which can greatly save the clustering time while keeping high accuracy by iteratively sampling, clustering, and matching log sequences. We then identify the impactful problems by correlating the clusters of log sequences with system KPIs. We evaluate our approach using real-world log data collected from an online service system, and the results confirm its effectiveness and efficiency. Furthermore, we have also successfully applied the proposed approach in industrial practice.