In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

IBM Watson AIOps helps IT operations management personnel and site reliability engineers (SREs) detect anomalies early, predict them before they occur, reduce event/alert noise by grouping events/alerts related to same incidents, locate the specific application or infrastructure component faults and failures, determine the scope of incident impact, and recommend relevant and timely actions based on mining prior incident records and/or tickets.

All these analytics help reduce the mean time to detect an incident (MTTD) and mean time to identify/isolate the cause of an incident (MTTI), and, thereby, the mean time to resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, lost references, etc.). How does it do it? Read this article to find out.

Defining the terms

In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

  • Log: A log is a data output from an application, service, server, or network driver that records a transaction or state occurring in it. Logs may or may not adhere to a format. A typical log has a minimum of two elements: time and a statement. The time is when the log was output from the application or service. The statement is a string that describes the transaction or state. Logs may have other elements, such as an error code, service/application name, protocol, etc. A sample application log looks something like this: {” utc_timestamp”: “2020-02-12 23:16:49.377000”, “message”: “ Resyncing ipsets with dataplane. Error invoking mmesh.EntityPredictionService/predictTokenizedDocument method on model”}. As you can see, it’s a combination of structured and unstructured data.
  • Metric: A metric is the measurement of a value of an aspect of an application or infrastructure component that is being monitored. In the hardware world, metrics would be based on things like CPUs, memory (RAM), Disk I/O, Network I/O, etc. Specific examples include CPU and Memory % used (say, 95%). A typical metric has a minimum of two elements: time and value. The time is when the metric capture occurred, and the value is the measure of that metric at that time. There may be more elements. It should be noted that count of log lines of a kind (e.g., error messages) in a given time interval also constitutes a metric.
  • Anomaly: An anomaly is a deviation from the expected baseline behavior. It is a state that occurs when the value of a metric (including those derived from logs, as noted before) is outside of the expected range. This is a binary state; the value is either anomalous or it is not. Anomalies can be those that require an IT operations person’s attention (aka persistent anomalies) or those that are self-resolving (aka transient) and therefore may not need to be raised up as an alert (please see definition of an alert below). An anomaly could be a 500 error (internal server error) or simply a deviation from the normal distribution or pattern of metric values within a given time.
  • Event: An event is a change in state of service. It indicates that something noteworthy has happened. An event by itself may not mean that something bad has happened. An example of an event that is noteworthy that requires no action is that ‘a container has moved to a new host/pod.’ An example of an event that is noteworthy that might require an action is ‘a disk drive failure.’ This, in fact, is an alert. More details on alerts are below. Not all events are alerts. All alerts could be events.
  • Fault: A fault is a defect in the internal state of a component in a system. A “fault is a deviation of a system from behavior described in its specification.” Fault causes an error by activation [Pecchia 2011]. For example, a defect is dormant in a software sub routine until that sub routine is not called.
  • Error: An error is part of the state that is incorrect. Errors make faults apparent. Not all errors are perceived by end users. Those errors that are perceived by users become faults. Errors cause failures. So, error belongs in the information domain, while a failure is typically perceived in the user domain [Pecchia 2011]. An error is often detected from the logs of IT systems. In a cloud native, containerized environment, a pod being not accessible is an error, but if the pod restarts on its own and there is no service delay, a user may not perceive a ‘pod not accessible error’ and, therefore, it may not turn into a fault.
  • Failure: A failure is the surfacing of an error to the user. A fault causes an error and an error causes failure. A failure belongs to an external universe. In IT Operations domain, the term failure is used to represent the failures that are experienced by the end users of IT systems.
  • Alert: An alert is a record of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) either human or automatic attention. For example, a disk drive failure event or network link down event can be raised as alerts. An alert is generated from an application or service and is delivered to a consumer of that alert. The consumer may be a person, an application or service, a tool (such as PagerDuty), a device (such as a cellphone or pager), etc. The alert would contain relevant information about the application or service to help the alert consumer take an action.
  • Incident: An incident is an unplanned interruption that causes, may cause, or reduces the quality of an IT Service [ITIL]. One or more alerts give an indication of an incident. An incident can be internal or external client-impacting. Typically, incidents require prompt attention. An unresponsive or unavailable business application is an example of an incident.
  • Problem/Issue: A problem is a cause or a potential cause of one or more recurring or similar incidents. It often indicates an issue at a deeper, systemic level. A problem is sometimes referred to as an issue. A problem is documented as a problem ticket in a service management tool.
  • Incident Ticket: A ticket is a formal record of an incident. When a track-worthy incident occurs, a ticket is created either by an IT operations person or automatically in a service desk or ticketing tool.
  • Customer Impacting Events (CIEs): Those incidents that impact paying customers are known as customer impacting events. The impact could be unavailability of the service or degradation of performance of the service. Customer impacting events are high-severity incidents that need immediate attention.
  • Root Cause: A root cause is the underlying cause of an incident. Often, fixing a root cause prevents the problem from recurring.  
  • Topology: An application and network topology refers to a map or a diagram that lays out the connections between different mission-critical applications in an enterprise. Topology can capture various relationships such as ‘depends on’, ‘manages’, ‘owns’, ‘realizes’, ‘deployedTo’ etc.
  • Runbook: Runbooks are standardized procedures for performing IT tasks. These could be documentation, scripts, step-by-step rules, etc. Runbooks can be manually executed or run automatically.


  1. ITIL: Root cause
  2. [Pecchia 2011] On the use of event logs for the analysis of system failures. 2011 Ph.D thesis dissertation submitted to University of Naples.
Was this article helpful?

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

4 min read - Data is the lifeblood of every organization. As your organization’s data footprint expands across the clouds and between your own business lines to drive value, it is essential to secure data at all stages of the cloud adoption and throughout the data lifecycle. While there are different mechanisms available to encrypt data throughout its lifecycle (in transit, at rest and in use), application-level encryption (ALE) provides an additional layer of protection by encrypting data at its source. ALE can enhance…

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters