How to leverage logs to find similar incidents for a given event.

Artificial Intelligence for IT Operations (AIOps) are software systems that combine big data and artificial intelligence (AI) or machine learning (ML) to mine a voluminous amount of information coming from disparate data sources for identifying events (e.g., logs, metrics, alerts, incidents, anomalies). AIOps then correlates and groups them by inferring patterns for fault localization and uses this information to find similar historical incidents for action recommendation.

In Catchpoint’s SRE Report 2020, 80% of SREs work on post-mortem analysis of incidents due to lack of provided information and 16% of toil comes from investigating false positives/negatives. Incident management includes finding similar incidents for a given event. [1]

This is a challenging problem because the vocabulary of alerts and incidents can be different; also, alert descriptions are machine generated, whereas incident descriptions are human generated. Moreover, it may be the case that two or more events may have the same description, however, the underlying root causes are different. This article addresses the above challenge by leveraging logs for finding similar incidents.


This section defines the terms related to incident management that we will be using throughout this article:

  • An event indicates that something of note has happened and is associated with one or more applications, services or other managed resources. For instance, a container is moved to a new node, column is added to a DB table, a new version of an application is deployed or memory or CPU is exhausted.
  • An alert is a record (type) of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) human or automatic attention and actions toward remediation. For instance, disk drive failure or a network link down could be alerts.
  • An incident represents a reduction in the quality of a business application or service. It is driven by one or more alerts. Incidents require prompt attention. For instance, an unresponsive application or inaccessible storage array could be serious outages.
  • Logs are a fundamental source of data generated from every level of components in an application. In each log line, details about the event — such as a resource that was accessed, who accessed it and the time — are included.

Finding similar events

Two events may or may not have a similar description but if the underlying logs are similar, then they are most likely related to each other — this is the key hypothesis of using logs for finding similar events.

Each application consists of several microservices, and some of these services are related to other services, forming a graph. If one service fails, then any other service which is upstream or downstream of the failed service could throw error log lines. It is important to identify error log lines corresponding to each failed microservice and collate them together to form a log signature for a particular event. We obtain log lines corresponding to each event from the time window of +- 5 minutes from outage start time (i.e., 10 minutes of log data). Each log line from the set of log lines is input to a pretrained error classifier; the output of the classifier is a 0 (error) or 1 (non-erroneous). The error classifier allows us to separate log lines pertaining to a healthy state of the system and the corresponding microservice from the non-erroneous log lines.

In order to use error log lines for event similarity, each log line is processed and templatized, and then they are collated to form a log-signature for each event. The objective of templatization is to normalize log lines to a common id, called as template-id. As a result, for a given event, there is a set of templates-ids and corresponding application-ids. We propose a log-signature representation for each event from its template-ids and corresponding application-ids, and use that for event similarity.

The example below shows a log signature for an event. There are three log template ids: template_id_atemplate_id_b and template_id_c. Two log template ids (template_id_a and template_id_b) belong to application_id_a, and one log template id (template_id_c) belongs to application_id_b. This representation is called as log signature of an event:

  "templates": [{
     "application_id": "application_id_a",
     "template": "template_id_a"
  }, {
     "application_id": "application_id_a",
     "template": "template_id_b"
  }, {
     "application_id": "application_id_b",
     "template": "template_id_c"

Once we have a log signature for each event, the similarity is calculated between two events by computing the overlap between their application ids. For each application id that overlaps, it computes the overlap between their respective templates ids to calculate a score called as log template similarity score.

Hypothesis testing

In this section, we want to verify the hypothesis that the two similar events may or may not have a lexically matching incident descriptions, but that their logs should have high overlap and that they are discriminative. Figure 1 shows four events where SREs communicated to us that they were similar to each other:

Figure 1: A set of four events that SREs described as similar to each other. Values in blue and green are text based and log-template based similarity scores, respectively.

We computed the similarity between them using the two methods, text-based similarity and log-template-based similarity. To compute the event-description-based similarity between two events, we obtain the distributed representation using universal sentence encoder for each event in the pair and then compute cosine similarity between them.

In the previous section, we outlined our method for calculating log-template-based similarity between two events. These results show that whenever text descriptions have high overlapping terms, the text-based similarity method have high scores for them. However, when there are few overlapping terms, the text-based similarity has a lower score. For example, the similarity between incident descriptions “database processing delayed for some users” and “Customers unable to view DB dashboard” have a low similarity score of 0.055. As per the ground truth communicated by the SRE, these two events are actually related to each other.

When we use log-template-based similarity to compute similarity between events, we observe that it captures the relatedness between events very well. This is because the similarity is computed based on the symptoms reflected in the logs captured through log signatures. For example, for the pair mentioned above, the log-template-based similarity score is 0.783, which indicates that their log signatures do have a high overlap, thus indicating high relatedness between them.


Using text description of events to compute similarity between them is not reliable and may result in inaccuracies. This article presents an approach that leverage logs for computing similarity between events and shows superior performance of the proposed method over the traditional text-based similarity method.

Additional resources


[1] Chen, Y., Yang, X., Dong, H., He, X., Zhang, H., Lin, Q., Chen, J., Zhao, P., Kang,Y., Gao, F., et al.: Identifying linked incidents in large-scale online service systems. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 304– 314 (2020)

Was this article helpful?

More from Cloud

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

A clear path to value: Overcome challenges on your FinOps journey 

3 min read - In recent years, cloud adoption services have accelerated, with companies increasingly moving from traditional on-premises hosting to public cloud solutions. However, the rise of hybrid and multi-cloud patterns has led to challenges in optimizing value and controlling cloud expenditure, resulting in a shift from capital to operational expenses.   According to a Gartner report, cloud operational expenses are expected to surpass traditional IT spending, reflecting the ongoing transformation in expenditure patterns by 2025. FinOps is an evolving cloud financial management discipline…

IBM Power8 end of service: What are my options?

3 min read - IBM Power8® generation of IBM Power Systems was introduced ten years ago and it is now time to retire that generation. The end-of-service (EoS) support for the entire IBM Power8 server line is scheduled for this year, commencing in March 2024 and concluding in October 2024. EoS dates vary by model: 31 March 2024: maintenance expires for Power Systems S812LC, S822, S822L, 822LC, 824 and 824L. 31 May 2024: maintenance expires for Power Systems S812L, S814 and 822LC. 31 October…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters