In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

IBM Watson AIOps helps IT operations management personnel and site reliability engineers (SREs) detect anomalies early, predict them before they occur, reduce event/alert noise by grouping events/alerts related to same incidents, locate the specific application or infrastructure component faults and failures, determine the scope of incident impact, and recommend relevant and timely actions based on mining prior incident records and/or tickets.

All these analytics help reduce the mean time to detect an incident (MTTD) and mean time to identify/isolate the cause of an incident (MTTI), and, thereby, the mean time to resolve (MTTR) an incident. This, in turn, saves millions of dollars by preventing direct costs (lost revenue, penalties, opportunity costs, etc.) and indirect costs (customer dissatisfaction, lost customers, lost references, etc.). How does it do it? Read this article to find out.

Defining the terms

In this article, we define the terms that we use in Watson AIOps analytical pipelines, such as anomalies, alerts, incidents, faults, etc.

  • Log: A log is a data output from an application, service, server, or network driver that records a transaction or state occurring in it. Logs may or may not adhere to a format. A typical log has a minimum of two elements: time and a statement. The time is when the log was output from the application or service. The statement is a string that describes the transaction or state. Logs may have other elements, such as an error code, service/application name, protocol, etc. A sample application log looks something like this: {” utc_timestamp”: “2020-02-12 23:16:49.377000”, “message”: “ Resyncing ipsets with dataplane. Error invoking mmesh.EntityPredictionService/predictTokenizedDocument method on model”}. As you can see, it’s a combination of structured and unstructured data.
  • Metric: A metric is the measurement of a value of an aspect of an application or infrastructure component that is being monitored. In the hardware world, metrics would be based on things like CPUs, memory (RAM), Disk I/O, Network I/O, etc. Specific examples include CPU and Memory % used (say, 95%). A typical metric has a minimum of two elements: time and value. The time is when the metric capture occurred, and the value is the measure of that metric at that time. There may be more elements. It should be noted that count of log lines of a kind (e.g., error messages) in a given time interval also constitutes a metric.
  • Anomaly: An anomaly is a deviation from the expected baseline behavior. It is a state that occurs when the value of a metric (including those derived from logs, as noted before) is outside of the expected range. This is a binary state; the value is either anomalous or it is not. Anomalies can be those that require an IT operations person’s attention (aka persistent anomalies) or those that are self-resolving (aka transient) and therefore may not need to be raised up as an alert (please see definition of an alert below). An anomaly could be a 500 error (internal server error) or simply a deviation from the normal distribution or pattern of metric values within a given time.
  • Event: An event is a change in state of service. It indicates that something noteworthy has happened. An event by itself may not mean that something bad has happened. An example of an event that is noteworthy that requires no action is that ‘a container has moved to a new host/pod.’ An example of an event that is noteworthy that might require an action is ‘a disk drive failure.’ This, in fact, is an alert. More details on alerts are below. Not all events are alerts. All alerts could be events.
  • Fault: A fault is a defect in the internal state of a component in a system. A “fault is a deviation of a system from behavior described in its specification.” Fault causes an error by activation [Pecchia 2011]. For example, a defect is dormant in a software sub routine until that sub routine is not called.
  • Error: An error is part of the state that is incorrect. Errors make faults apparent. Not all errors are perceived by end users. Those errors that are perceived by users become faults. Errors cause failures. So, error belongs in the information domain, while a failure is typically perceived in the user domain [Pecchia 2011]. An error is often detected from the logs of IT systems. In a cloud native, containerized environment, a pod being not accessible is an error, but if the pod restarts on its own and there is no service delay, a user may not perceive a ‘pod not accessible error’ and, therefore, it may not turn into a fault.
  • Failure: A failure is the surfacing of an error to the user. A fault causes an error and an error causes failure. A failure belongs to an external universe. In IT Operations domain, the term failure is used to represent the failures that are experienced by the end users of IT systems.
  • Alert: An alert is a record of an event indicating a (fault) condition in the managed environment. It requires (or will require in the future) either human or automatic attention. For example, a disk drive failure event or network link down event can be raised as alerts. An alert is generated from an application or service and is delivered to a consumer of that alert. The consumer may be a person, an application or service, a tool (such as PagerDuty), a device (such as a cellphone or pager), etc. The alert would contain relevant information about the application or service to help the alert consumer take an action.
  • Incident: An incident is an unplanned interruption that causes, may cause, or reduces the quality of an IT Service [ITIL]. One or more alerts give an indication of an incident. An incident can be internal or external client-impacting. Typically, incidents require prompt attention. An unresponsive or unavailable business application is an example of an incident.
  • Problem/Issue: A problem is a cause or a potential cause of one or more recurring or similar incidents. It often indicates an issue at a deeper, systemic level. A problem is sometimes referred to as an issue. A problem is documented as a problem ticket in a service management tool.
  • Incident Ticket: A ticket is a formal record of an incident. When a track-worthy incident occurs, a ticket is created either by an IT operations person or automatically in a service desk or ticketing tool.
  • Customer Impacting Events (CIEs): Those incidents that impact paying customers are known as customer impacting events. The impact could be unavailability of the service or degradation of performance of the service. Customer impacting events are high-severity incidents that need immediate attention.
  • Root Cause: A root cause is the underlying cause of an incident. Often, fixing a root cause prevents the problem from recurring.  
  • Topology: An application and network topology refers to a map or a diagram that lays out the connections between different mission-critical applications in an enterprise. Topology can capture various relationships such as ‘depends on’, ‘manages’, ‘owns’, ‘realizes’, ‘deployedTo’ etc.
  • Runbook: Runbooks are standardized procedures for performing IT tasks. These could be documentation, scripts, step-by-step rules, etc. Runbooks can be manually executed or run automatically.

References

  1. ITIL: Root cause
  2. [Pecchia 2011] On the use of event logs for the analysis of system failures. 2011 Ph.D thesis dissertation submitted to University of Naples.
Was this article helpful?
YesNo

More from Cloud

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

The advantages and disadvantages of private cloud 

6 min read - The popularity of private cloud is growing, primarily driven by the need for greater data security. Across industries like education, retail and government, organizations are choosing private cloud settings to conduct business use cases involving workloads with sensitive information and to comply with data privacy and compliance needs. In a report from Technavio (link resides outside ibm.com), the private cloud services market size is estimated to grow at a CAGR of 26.71% between 2023 and 2028, and it is forecast to increase by…

Optimize observability with IBM Cloud Logs to help improve infrastructure and app performance

5 min read - There is a dilemma facing infrastructure and app performance—as workloads generate an expanding amount of observability data, it puts increased pressure on collection tool abilities to process it all. The resulting data stress becomes expensive to manage and makes it harder to obtain actionable insights from the data itself, making it harder to have fast, effective, and cost-efficient performance management. A recent IDC study found that 57% of large enterprises are either collecting too much or too little observability data.…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters