How to leverage logs to find similar incidents for a given event.

Artificial Intelligence for IT Operations (AIOps) are software systems that combine big data and artificial intelligence (AI) or machine learning (ML) to mine a voluminous amount of information coming from disparate data sources for identifying events (e.g., logs, metrics, alerts, incidents, anomalies). AIOps then correlates and groups them by inferring patterns for fault localization and uses this information to find similar historical incidents for action recommendation.

In Catchpoint’s SRE Report 2020, 80% of SREs work on post-mortem analysis of incidents due to lack of provided information and 16% of toil comes from investigating false positives/negatives. Incident management includes finding similar incidents for a given event. [1]

This is a challenging problem because the vocabulary of alerts and incidents can be different; also, alert descriptions are machine generated, whereas incident descriptions are human generated. Moreover, it may be the case that two or more events may have the same description, however, the underlying root causes are different. This article addresses the above challenge by leveraging logs for finding similar incidents.


This section defines the terms related to incident management that we will be using throughout this article:

  • An event indicates that something of note has happened and is associated with one or more applications, services or other managed resources. For instance, a container is moved to a new node, column is added to a DB table, a new version of an application is deployed or memory or CPU is exhausted.
  • An alert is a record (type) of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) human or automatic attention and actions toward remediation. For instance, disk drive failure or a network link down could be alerts.
  • An incident represents a reduction in the quality of a business application or service. It is driven by one or more alerts. Incidents require prompt attention. For instance, an unresponsive application or inaccessible storage array could be serious outages.
  • Logs are a fundamental source of data generated from every level of components in an application. In each log line, details about the event — such as a resource that was accessed, who accessed it and the time — are included.

Finding similar events

Two events may or may not have a similar description but if the underlying logs are similar, then they are most likely related to each other — this is the key hypothesis of using logs for finding similar events.

Each application consists of several microservices, and some of these services are related to other services, forming a graph. If one service fails, then any other service which is upstream or downstream of the failed service could throw error log lines. It is important to identify error log lines corresponding to each failed microservice and collate them together to form a log signature for a particular event. We obtain log lines corresponding to each event from the time window of +- 5 minutes from outage start time (i.e., 10 minutes of log data). Each log line from the set of log lines is input to a pretrained error classifier; the output of the classifier is a 0 (error) or 1 (non-erroneous). The error classifier allows us to separate log lines pertaining to a healthy state of the system and the corresponding microservice from the non-erroneous log lines.

In order to use error log lines for event similarity, each log line is processed and templatized, and then they are collated to form a log-signature for each event. The objective of templatization is to normalize log lines to a common id, called as template-id. As a result, for a given event, there is a set of templates-ids and corresponding application-ids. We propose a log-signature representation for each event from its template-ids and corresponding application-ids, and use that for event similarity.

The example below shows a log signature for an event. There are three log template ids: template_id_atemplate_id_b and template_id_c. Two log template ids (template_id_a and template_id_b) belong to application_id_a, and one log template id (template_id_c) belongs to application_id_b. This representation is called as log signature of an event:

  "templates": [{
     "application_id": "application_id_a",
     "template": "template_id_a"
  }, {
     "application_id": "application_id_a",
     "template": "template_id_b"
  }, {
     "application_id": "application_id_b",
     "template": "template_id_c"
Scroll to view full table

Once we have a log signature for each event, the similarity is calculated between two events by computing the overlap between their application ids. For each application id that overlaps, it computes the overlap between their respective templates ids to calculate a score called as log template similarity score.

Hypothesis testing

In this section, we want to verify the hypothesis that the two similar events may or may not have a lexically matching incident descriptions, but that their logs should have high overlap and that they are discriminative. Figure 1 shows four events where SREs communicated to us that they were similar to each other:

Figure 1: A set of four events that SREs described as similar to each other. Values in blue and green are text based and log-template based similarity scores, respectively.

We computed the similarity between them using the two methods, text-based similarity and log-template-based similarity. To compute the event-description-based similarity between two events, we obtain the distributed representation using universal sentence encoder for each event in the pair and then compute cosine similarity between them.

In the previous section, we outlined our method for calculating log-template-based similarity between two events. These results show that whenever text descriptions have high overlapping terms, the text-based similarity method have high scores for them. However, when there are few overlapping terms, the text-based similarity has a lower score. For example, the similarity between incident descriptions “database processing delayed for some users” and “Customers unable to view DB dashboard” have a low similarity score of 0.055. As per the ground truth communicated by the SRE, these two events are actually related to each other.

When we use log-template-based similarity to compute similarity between events, we observe that it captures the relatedness between events very well. This is because the similarity is computed based on the symptoms reflected in the logs captured through log signatures. For example, for the pair mentioned above, the log-template-based similarity score is 0.783, which indicates that their log signatures do have a high overlap, thus indicating high relatedness between them.


Using text description of events to compute similarity between them is not reliable and may result in inaccuracies. This article presents an approach that leverage logs for computing similarity between events and shows superior performance of the proposed method over the traditional text-based similarity method.

Additional resources


[1] Chen, Y., Yang, X., Dong, H., He, X., Zhang, H., Lin, Q., Chen, J., Zhao, P., Kang,Y., Gao, F., et al.: Identifying linked incidents in large-scale online service systems. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 304– 314 (2020)

More from Cloud

Modernizing child support enforcement with IBM and AWS

7 min read - With 68% of child support enforcement (CSE) systems aging, most state agencies are currently modernizing them or preparing to modernize. More than 20% of families and children are supported by these systems, and with the current constituents of these systems becoming more consumer technology-centric, the use of antiquated technology systems is archaic and unsustainable. At this point, families expect state agencies to have a modern, efficient child support system. The following are some factors driving these states to pursue modernization:…

7 min read

IBM Cloud Databases for Elasticsearch End of Life and pricing changes

2 min read - As part of our partnership with Elastic, IBM is announcing the release of a new version of IBM Cloud Databases for Elasticsearch. We are excited to bring you an enhanced offering of our enterprise-ready, fully managed Elasticsearch. Our partnership with Elastic means that we will be able to offer more, richer functionality and world-class levels of support. The release of version 7.17 of our managed database service will include support for additional functionality, including things like Role Based Access Control…

2 min read

Connected products at the edge

6 min read - There are many overlapping business usage scenarios involving both the disciplines of the Internet of Things (IoT) and edge computing. But there is one very practical and promising use case that has been commonly deployed without many people thinking about it: connected products. This use case involves devices and equipment embedded with sensors, software and connectivity that exchange data with other products, operators or environments in real-time. In this blog post, we will look at the frequently overlooked phenomenon of…

6 min read

SRG Technology drives global software services with IBM Cloud VPC under the hood

4 min read - Headquartered in Ft. Lauderdale, Florida, SRG Technology LLC. (SRGT) is a software development company supporting the education, healthcare and travel industries. Their team creates data systems that deliver the right data in real time to customers around the globe. Whether those customers are medical offices and hospitals, schools or school districts, government agencies, or individual small businesses, SRGT addresses a wide spectrum of software services and technology needs with round-the-clock innovative thinking and fresh approaches to modern data problems. The…

4 min read