Leveraging Log Data for Incident Management in AIOps
25 May 2021
5 min read
How to leverage logs to find similar incidents for a given event.

Artificial Intelligence for IT Operations (AIOps) are software systems that combine big data and artificial intelligence (AI) or machine learning (ML) to mine a voluminous amount of information coming from disparate data sources for identifying events (e.g., logs, metrics, alerts, incidents, anomalies). AIOps then correlates and groups them by inferring patterns for fault localization and uses this information to find similar historical incidents for action recommendation.

In Catchpoint’s SRE Report 2020, 80% of SREs work on post-mortem analysis of incidents due to lack of provided information and 16% of toil comes from investigating false positives/negatives. Incident management includes finding similar incidents for a given event. [1]

This is a challenging problem because the vocabulary of alerts and incidents can be different; also, alert descriptions are machine generated, whereas incident descriptions are human generated. Moreover, it may be the case that two or more events may have the same description, however, the underlying root causes are different. This article addresses the above challenge by leveraging logs for finding similar incidents.

 
Terminology

This section defines the terms related to incident management that we will be using throughout this article:

  • An event indicates that something of note has happened and is associated with one or more applications, services or other managed resources. For instance, a container is moved to a new node, column is added to a DB table, a new version of an application is deployed or memory or CPU is exhausted.
  • An alert is a record (type) of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) human or automatic attention and actions toward remediation. For instance, disk drive failure or a network link down could be alerts.
  • An incident represents a reduction in the quality of a business application or service. It is driven by one or more alerts. Incidents require prompt attention. For instance, an unresponsive application or inaccessible storage array could be serious outages.
  • Logs are a fundamental source of data generated from every level of components in an application. In each log line, details about the event — such as a resource that was accessed, who accessed it and the time — are included.
Finding similar events

Two events may or may not have a similar description but if the underlying logs are similar, then they are most likely related to each other — this is the key hypothesis of using logs for finding similar events.

Each application consists of several microservices, and some of these services are related to other services, forming a graph. If one service fails, then any other service which is upstream or downstream of the failed service could throw error log lines. It is important to identify error log lines corresponding to each failed microservice and collate them together to form a log signature for a particular event. We obtain log lines corresponding to each event from the time window of +- 5 minutes from outage start time (i.e., 10 minutes of log data). Each log line from the set of log lines is input to a pretrained error classifier; the output of the classifier is a 0 (error) or 1 (non-erroneous). The error classifier allows us to separate log lines pertaining to a healthy state of the system and the corresponding microservice from the non-erroneous log lines.

In order to use error log lines for event similarity, each log line is processed and templatized, and then they are collated to form a log-signature for each event. The objective of templatization is to normalize log lines to a common id, called as template-id. As a result, for a given event, there is a set of templates-ids  and corresponding application-ids . We propose a log-signature representation for each event from its template-ids  and corresponding application-ids, and use that for event similarity.

The example below shows a log signature for an event. There are three log template ids: template_id_atemplate_id_b  and template_id_c . Two log template ids (template_id_a  and template_id_b ) belong to application_id_a , and one log template id (template_id_c ) belongs to application_id_b . This representation is called as log signature of an event:

{
"templates": [{
   "application_id": "application_id_a",
   "template": "template_id_a"
}, {
   "application_id": "application_id_a",
   "template": "template_id_b"
}, {
   "application_id": "application_id_b",
   "template": "template_id_c"
}]
}

Once we have a log signature for each event, the similarity is calculated between two events by computing the overlap between their application ids. For each application id that overlaps, it computes the overlap between their respective templates ids to calculate a score called as log template similarity score.

Hypothesis testing

In this section, we want to verify the hypothesis that the two similar events may or may not have a lexically matching incident descriptions, but that their logs should have high overlap and that they are discriminative. Figure 1 shows four events where SREs communicated to us that they were similar to each other:

We computed the similarity between them using the two methods, text-based similarity and log-template-based similarity. To compute the event-description-based similarity between two events, we obtain the distributed representation using universal sentence encoder for each event in the pair and then compute cosine similarity between them.

In the previous section, we outlined our method for calculating log-template-based similarity between two events. These results show that whenever text descriptions have high overlapping terms, the text-based similarity method have high scores for them. However, when there are few overlapping terms, the text-based similarity has a lower score. For example, the similarity between incident descriptions “database processing delayed for some users” and “Customers unable to view DB dashboard” have a low similarity score of 0.055. As per the ground truth communicated by the SRE, these two events are actually related to each other.

When we use log-template-based similarity to compute similarity between events, we observe that it captures the relatedness between events very well. This is because the similarity is computed based on the symptoms reflected in the logs captured through log signatures. For example, for the pair mentioned above, the log-template-based similarity score is 0.783, which indicates that their log signatures do have a high overlap, thus indicating high relatedness between them.

Summary

Using text description of events to compute similarity between them is not reliable and may result in inaccuracies. This article presents an approach that leverage logs for computing similarity between events and shows superior performance of the proposed method over the traditional text-based similarity method.

Additional resources
Reference

[1] Chen, Y., Yang, X., Dong, H., He, X., Zhang, H., Lin, Q., Chen, J., Zhao, P., Kang,Y., Gao, F., et al.: Identifying linked incidents in large-scale online service systems. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 304– 314 (2020)

Author
Harshit Kumar IBM Research
Ajay Gupta Senior Research Engineer
Haibin Liu Senior Software Engineer
Anbang Xu Data Scientist, Master Inventor
Gargi Dasgupta Director and CTO, IBM Research India