In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

IBM Watson AIOps helps IT operations management personnel and site reliability engineers (SREs) detect anomalies early, predict them before they occur, reduce event/alert noise by grouping events/alerts related to same incidents, locate the specific application or infrastructure component faults and failures, determine the scope of incident impact, and recommend relevant and timely actions based on mining prior incident records and/or tickets.

All these analytics help reduce the mean time to detect an incident (MTTD) and mean time to identify/isolate the cause of an incident (MTTI), and, thereby, the mean time to resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, lost references, etc.). How does it do it? Read this article to find out.

Defining the terms

In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

  • Log: A log is a data output from an application, service, server, or network driver that records a transaction or state occurring in it. Logs may or may not adhere to a format. A typical log has a minimum of two elements: time and a statement. The time is when the log was output from the application or service. The statement is a string that describes the transaction or state. Logs may have other elements, such as an error code, service/application name, protocol, etc. A sample application log looks something like this: {” utc_timestamp”: “2020-02-12 23:16:49.377000”, “message”: “ Resyncing ipsets with dataplane. Error invoking mmesh.EntityPredictionService/predictTokenizedDocument method on model”}. As you can see, it’s a combination of structured and unstructured data.
  • Metric: A metric is the measurement of a value of an aspect of an application or infrastructure component that is being monitored. In the hardware world, metrics would be based on things like CPUs, memory (RAM), Disk I/O, Network I/O, etc. Specific examples include CPU and Memory % used (say, 95%). A typical metric has a minimum of two elements: time and value. The time is when the metric capture occurred, and the value is the measure of that metric at that time. There may be more elements. It should be noted that count of log lines of a kind (e.g., error messages) in a given time interval also constitutes a metric.
  • Anomaly: An anomaly is a deviation from the expected baseline behavior. It is a state that occurs when the value of a metric (including those derived from logs, as noted before) is outside of the expected range. This is a binary state; the value is either anomalous or it is not. Anomalies can be those that require an IT operations person’s attention (aka persistent anomalies) or those that are self-resolving (aka transient) and therefore may not need to be raised up as an alert (please see definition of an alert below). An anomaly could be a 500 error (internal server error) or simply a deviation from the normal distribution or pattern of metric values within a given time.
  • Event: An event is a change in state of service. It indicates that something noteworthy has happened. An event by itself may not mean that something bad has happened. An example of an event that is noteworthy that requires no action is that ‘a container has moved to a new host/pod.’ An example of an event that is noteworthy that might require an action is ‘a disk drive failure.’ This, in fact, is an alert. More details on alerts are below. Not all events are alerts. All alerts could be events.
  • Fault: A fault is a defect in the internal state of a component in a system. A “fault is a deviation of a system from behavior described in its specification.” Fault causes an error by activation [Pecchia 2011]. For example, a defect is dormant in a software sub routine until that sub routine is not called.
  • Error: An error is part of the state that is incorrect. Errors make faults apparent. Not all errors are perceived by end users. Those errors that are perceived by users become faults. Errors cause failures. So, error belongs in the information domain, while a failure is typically perceived in the user domain [Pecchia 2011]. An error is often detected from the logs of IT systems. In a cloud native, containerized environment, a pod being not accessible is an error, but if the pod restarts on its own and there is no service delay, a user may not perceive a ‘pod not accessible error’ and, therefore, it may not turn into a fault.
  • Failure: A failure is the surfacing of an error to the user. A fault causes an error and an error causes failure. A failure belongs to an external universe. In IT Operations domain, the term failure is used to represent the failures that are experienced by the end users of IT systems.
  • Alert: An alert is a record of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) either human or automatic attention. For example, a disk drive failure event or network link down event can be raised as alerts. An alert is generated from an application or service and is delivered to a consumer of that alert. The consumer may be a person, an application or service, a tool (such as PagerDuty), a device (such as a cellphone or pager), etc. The alert would contain relevant information about the application or service to help the alert consumer take an action.
  • Incident: An incident is an unplanned interruption that causes, may cause, or reduces the quality of an IT Service [ITIL]. One or more alerts give an indication of an incident. An incident can be internal or external client-impacting. Typically, incidents require prompt attention. An unresponsive or unavailable business application is an example of an incident.
  • Problem/Issue: A problem is a cause or a potential cause of one or more recurring or similar incidents. It often indicates an issue at a deeper, systemic level. A problem is sometimes referred to as an issue. A problem is documented as a problem ticket in a service management tool.
  • Incident Ticket: A ticket is a formal record of an incident. When a track-worthy incident occurs, a ticket is created either by an IT operations person or automatically in a service desk or ticketing tool.
  • Customer Impacting Events (CIEs): Those incidents that impact paying customers are known as customer impacting events. The impact could be unavailability of the service or degradation of performance of the service. Customer impacting events are high-severity incidents that need immediate attention.
  • Root Cause: A root cause is the underlying cause of an incident. Often, fixing a root cause prevents the problem from recurring.  
  • Topology: An application and network topology refers to a map or a diagram that lays out the connections between different mission-critical applications in an enterprise. Topology can capture various relationships such as ‘depends on’, ‘manages’, ‘owns’, ‘realizes’, ‘deployedTo’ etc.
  • Runbook: Runbooks are standardized procedures for performing IT tasks. These could be documentation, scripts, step-by-step rules, etc. Runbooks can be manually executed or run automatically.


  1. ITIL: Root cause
  2. [Pecchia 2011] On the use of event logs for the analysis of system failures. 2011 Ph.D thesis dissertation submitted to University of Naples.
Was this article helpful?

More from Cloud

Announcing Dizzion Desktop as a Service for IBM Virtual Private Cloud (VPC)

2 min read - For more than four years, Dizzion and IBM Cloud® have strategically partnered to deliver incredible digital workspace experiences to our clients. We are excited to announce that Dizzion has expanded their Desktop as a Service (DaaS) offering to now support IBM Cloud Virtual Private Cloud (VPC). Powered by Frame, Dizzion’s cloud-native DaaS platform, clients can now deploy their Windows and Linux® virtual desktops and applications on IBM Cloud VPC and enjoy fast, dynamic, infrastructure provisioning and a true consumption-based model.…

Microcontrollers vs. microprocessors: What’s the difference?

6 min read - Microcontroller units (MCUs) and microprocessor units (MPUs) are two kinds of integrated circuits that, while similar in certain ways, are very different in many others. Replacing antiquated multi-component central processing units (CPUs) with separate logic units, these single-chip processors are both extremely valuable in the continued development of computing technology. However, microcontrollers and microprocessors differ significantly in component structure, chip architecture, performance capabilities and application. The key difference between these two units is that microcontrollers combine all the necessary elements…

Seven top central processing unit (CPU) use cases

7 min read - The central processing unit (CPU) is the computer’s brain, assigning and processing tasks and managing essential operational functions. Computers have been so seamlessly integrated with modern life that sometimes we’re not even aware of how many CPUs are in use around the world. It’s a staggering amount—so many CPUs that a conclusive figure can only be approximated. How many CPUs are now in use? It’s been estimated that there may be as many as 200 billion CPU cores (or more)…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters