How to leverage logs to find similar incidents for a given event.

Artificial Intelligence for IT Operations (AIOps) are software systems that combine big data and artificial intelligence (AI) or machine learning (ML) to mine a voluminous amount of information coming from disparate data sources for identifying events (e.g., logs, metrics, alerts, incidents, anomalies). AIOps then correlates and groups them by inferring patterns for fault localization and uses this information to find similar historical incidents for action recommendation.

In Catchpoint’s SRE Report 2020, 80% of SREs work on post-mortem analysis of incidents due to lack of provided information and 16% of toil comes from investigating false positives/negatives. Incident management includes finding similar incidents for a given event. [1]

This is a challenging problem because the vocabulary of alerts and incidents can be different; also, alert descriptions are machine generated, whereas incident descriptions are human generated. Moreover, it may be the case that two or more events may have the same description, however, the underlying root causes are different. This article addresses the above challenge by leveraging logs for finding similar incidents.


This section defines the terms related to incident management that we will be using throughout this article:

  • An event indicates that something of note has happened and is associated with one or more applications, services or other managed resources. For instance, a container is moved to a new node, column is added to a DB table, a new version of an application is deployed or memory or CPU is exhausted.
  • An alert is a record (type) of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) human or automatic attention and actions toward remediation. For instance, disk drive failure or a network link down could be alerts.
  • An incident represents a reduction in the quality of a business application or service. It is driven by one or more alerts. Incidents require prompt attention. For instance, an unresponsive application or inaccessible storage array could be serious outages.
  • Logs are a fundamental source of data generated from every level of components in an application. In each log line, details about the event — such as a resource that was accessed, who accessed it and the time — are included.

Finding similar events

Two events may or may not have a similar description but if the underlying logs are similar, then they are most likely related to each other — this is the key hypothesis of using logs for finding similar events.

Each application consists of several microservices, and some of these services are related to other services, forming a graph. If one service fails, then any other service which is upstream or downstream of the failed service could throw error log lines. It is important to identify error log lines corresponding to each failed microservice and collate them together to form a log signature for a particular event. We obtain log lines corresponding to each event from the time window of +- 5 minutes from outage start time (i.e., 10 minutes of log data). Each log line from the set of log lines is input to a pretrained error classifier; the output of the classifier is a 0 (error) or 1 (non-erroneous). The error classifier allows us to separate log lines pertaining to a healthy state of the system and the corresponding microservice from the non-erroneous log lines.

In order to use error log lines for event similarity, each log line is processed and templatized, and then they are collated to form a log-signature for each event. The objective of templatization is to normalize log lines to a common id, called as template-id. As a result, for a given event, there is a set of templates-ids and corresponding application-ids. We propose a log-signature representation for each event from its template-ids and corresponding application-ids, and use that for event similarity.

The example below shows a log signature for an event. There are three log template ids: template_id_atemplate_id_b and template_id_c. Two log template ids (template_id_a and template_id_b) belong to application_id_a, and one log template id (template_id_c) belongs to application_id_b. This representation is called as log signature of an event:

  "templates": [{
     "application_id": "application_id_a",
     "template": "template_id_a"
  }, {
     "application_id": "application_id_a",
     "template": "template_id_b"
  }, {
     "application_id": "application_id_b",
     "template": "template_id_c"

Once we have a log signature for each event, the similarity is calculated between two events by computing the overlap between their application ids. For each application id that overlaps, it computes the overlap between their respective templates ids to calculate a score called as log template similarity score.

Hypothesis testing

In this section, we want to verify the hypothesis that the two similar events may or may not have a lexically matching incident descriptions, but that their logs should have high overlap and that they are discriminative. Figure 1 shows four events where SREs communicated to us that they were similar to each other:

Figure 1: A set of four events that SREs described as similar to each other. Values in blue and green are text based and log-template based similarity scores, respectively.

We computed the similarity between them using the two methods, text-based similarity and log-template-based similarity. To compute the event-description-based similarity between two events, we obtain the distributed representation using universal sentence encoder for each event in the pair and then compute cosine similarity between them.

In the previous section, we outlined our method for calculating log-template-based similarity between two events. These results show that whenever text descriptions have high overlapping terms, the text-based similarity method have high scores for them. However, when there are few overlapping terms, the text-based similarity has a lower score. For example, the similarity between incident descriptions “database processing delayed for some users” and “Customers unable to view DB dashboard” have a low similarity score of 0.055. As per the ground truth communicated by the SRE, these two events are actually related to each other.

When we use log-template-based similarity to compute similarity between events, we observe that it captures the relatedness between events very well. This is because the similarity is computed based on the symptoms reflected in the logs captured through log signatures. For example, for the pair mentioned above, the log-template-based similarity score is 0.783, which indicates that their log signatures do have a high overlap, thus indicating high relatedness between them.


Using text description of events to compute similarity between them is not reliable and may result in inaccuracies. This article presents an approach that leverage logs for computing similarity between events and shows superior performance of the proposed method over the traditional text-based similarity method.

Additional resources


[1] Chen, Y., Yang, X., Dong, H., He, X., Zhang, H., Lin, Q., Chen, J., Zhao, P., Kang,Y., Gao, F., et al.: Identifying linked incidents in large-scale online service systems. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 304– 314 (2020)

Was this article helpful?

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

4 min read - Data is the lifeblood of every organization. As your organization’s data footprint expands across the clouds and between your own business lines to drive value, it is essential to secure data at all stages of the cloud adoption and throughout the data lifecycle. While there are different mechanisms available to encrypt data throughout its lifecycle (in transit, at rest and in use), application-level encryption (ALE) provides an additional layer of protection by encrypting data at its source. ALE can enhance…

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters