A Glossary for IT Operations Management

In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

IBM Watson AIOps helps IT operations management personnel and site reliability engineers (SREs) detect anomalies early, predict them before they occur, reduce event/alert noise by grouping events/alerts related to same incidents, locate the specific application or infrastructure component faults and failures, determine the scope of incident impact, and recommend relevant and timely actions based on mining prior incident records and/or tickets.

All these analytics help reduce the mean time to detect an incident (MTTD) and mean time to identify/isolate the cause of an incident (MTTI), and, thereby, the mean time to resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, lost references, etc.). How does it do it? Read this article to find out.

Defining the terms

In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

Log: A log is a data output from an application, service, server, or network driver that records a transaction or state occurring in it. Logs may or may not adhere to a format. A typical log has a minimum of two elements: time and a statement. The time is when the log was output from the application or service. The statement is a string that describes the transaction or state. Logs may have other elements, such as an error code, service/application name, protocol, etc. A sample application log looks something like this: {” utc_timestamp”: “2020-02-12 23:16:49.377000”, “message”: “ Resyncing ipsets with dataplane. Error invoking mmesh.EntityPredictionService/predictTokenizedDocument method on model”}. As you can see, it’s a combination of structured and unstructured data.
Metric: A metric is the measurement of a value of an aspect of an application or infrastructure component that is being monitored. In the hardware world, metrics would be based on things like CPUs, memory (RAM), Disk I/O, Network I/O, etc. Specific examples include CPU and Memory % used (say, 95%). A typical metric has a minimum of two elements: time and value. The time is when the metric capture occurred, and the value is the measure of that metric at that time. There may be more elements. It should be noted that count of log lines of a kind (e.g., error messages) in a given time interval also constitutes a metric.
Anomaly: An anomaly is a deviation from the expected baseline behavior. It is a state that occurs when the value of a metric (including those derived from logs, as noted before) is outside of the expected range. This is a binary state; the value is either anomalous or it is not. Anomalies can be those that require an IT operations person’s attention (aka persistent anomalies) or those that are self-resolving (aka transient) and therefore may not need to be raised up as an alert (please see definition of an alert below). An anomaly could be a 500 error (internal server error) or simply a deviation from the normal distribution or pattern of metric values within a given time.
Event: An event is a change in state of service. It indicates that something noteworthy has happened. An event by itself may not mean that something bad has happened. An example of an event that is noteworthy that requires no action is that ‘a container has moved to a new host/pod.’ An example of an event that is noteworthy that might require an action is ‘a disk drive failure.’ This, in fact, is an alert. More details on alerts are below. Not all events are alerts. All alerts could be events.
Fault: A fault is a defect in the internal state of a component in a system. A “fault is a deviation of a system from behavior described in its specification.” Fault causes an error by activation [Pecchia 2011]. For example, a defect is dormant in a software sub routine until that sub routine is not called.
Error: An error is part of the state that is incorrect. Errors make faults apparent. Not all errors are perceived by end users. Those errors that are perceived by users become faults. Errors cause failures. So, error belongs in the information domain, while a failure is typically perceived in the user domain [Pecchia 2011]. An error is often detected from the logs of IT systems. In a cloud native, containerized environment, a pod being not accessible is an error, but if the pod restarts on its own and there is no service delay, a user may not perceive a ‘pod not accessible error’ and, therefore, it may not turn into a fault.
Failure: A failure is the surfacing of an error to the user. A fault causes an error and an error causes failure. A failure belongs to an external universe. In IT Operations domain, the term failure is used to represent the failures that are experienced by the end users of IT systems.
Alert: An alert is a record of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) either human or automatic attention. For example, a disk drive failure event or network link down event can be raised as alerts. An alert is generated from an application or service and is delivered to a consumer of that alert. The consumer may be a person, an application or service, a tool (such as PagerDuty), a device (such as a cellphone or pager), etc. The alert would contain relevant information about the application or service to help the alert consumer take an action.
Incident: An incident is an unplanned interruption that causes, may cause, or reduces the quality of an IT Service [ITIL]. One or more alerts give an indication of an incident. An incident can be internal or external client-impacting. Typically, incidents require prompt attention. An unresponsive or unavailable business application is an example of an incident.
Problem/Issue: A problem is a cause or a potential cause of one or more recurring or similar incidents. It often indicates an issue at a deeper, systemic level. A problem is sometimes referred to as an issue. A problem is documented as a problem ticket in a service management tool.
Incident Ticket: A ticket is a formal record of an incident. When a track-worthy incident occurs, a ticket is created either by an IT operations person or automatically in a service desk or ticketing tool.
Customer Impacting Events (CIEs): Those incidents that impact paying customers are known as customer impacting events. The impact could be unavailability of the service or degradation of performance of the service. Customer impacting events are high-severity incidents that need immediate attention.
Root Cause: A root cause is the underlying cause of an incident. Often, fixing a root cause prevents the problem from recurring.
Topology: An application and network topology refers to a map or a diagram that lays out the connections between different mission-critical applications in an enterprise. Topology can capture various relationships such as ‘depends on’, ‘manages’, ‘owns’, ‘realizes’, ‘deployedTo’ etc.
Runbook: Runbooks are standardized procedures for performing IT tasks. These could be documentation, scripts, step-by-step rules, etc. Runbooks can be manually executed or run automatically.

References

ITIL: Root cause
[Pecchia 2011] On the use of event logs for the analysis of system failures. 2011 Ph.D thesis dissertation submitted to University of Naples.

Was this article helpful?

YesNo

Rama Akkiraju

IBM Fellow, CTO, AI for IT Operations

Amit Paradkar

Distinguished RSM, Cognitive Service Foundations

Isabell Sippli

IBM STSM, AIOps and Netcool

Kristian Stewart

IBM Distingished Engineer, Watson AIOps