About metric anomaly detection

This AI algorithm generates alerts when it detects anomalous behavior in your metrics.

Description

Metric anomaly detection is composed of a set of unsupervised learning algorithms. These algorithms learn normal patterns of metric behavior by analyzing metric values at regular intervals. Then, metric anomaly detection raises anomalies or alerts when that behavior significantly changes.

Enabling metric anomaly detection helps IBM Cloud Pak® for AIOps discover metric anomalies. The first model, simple baseline, is produced after 4 data points. Simple baseline is also known as the naive baseline.

Unless otherwise stated, other models are produced when at least 7 days of data are present in the system, and training is completed. The algorithm can use up to 14 days of data to learn. After models are trained, you can be alerted to problems before services or applications are impacted.

Metric anomalies are generated based on the following analyses:

  • Dynamic calculation of baseline values for each metric. For example, a metric baseline might be within the 1 to 25 range at a particular time of day. If a metric value is returned outside of this range, then a metric anomaly alert is immediately generated.
  • Simple baseline, where less than 50% of data is available for a time series when training occurs, or the analytics determines that a dynamic baseline isn't a good fit for a particular time series. This baseline starts to be produced after 4 data points. After the analytics training occurs, this algorithm might raise alerts when values exist outside the simple baseline. Before the analytics training occurs, values can fall outside the simple baseline and not raise an alert.
  • Flatlining, whereby the system identifies that the metric started unexpectedly returning a constant value. Models are produced when at least 3.5 days of data are present in the system, and training is completed. The algorithm can use up to 7 days of data to learn.
  • Finite Domain, whereby the system detects an anomaly when a metric value elevates to a level not reached previously.
  • Predominant Range, whereby the system detects an anomaly when the variation in a metric value exceeds the range within which the metric normally varies.
  • A metric with learned variance and that is later found to vary significantly is also flagged as a metric anomaly alert.

Site reliability engineers (SREs) and other users responsible for application and service availability are able to display metric anomalies as alerts within the context of an incident, as described in the following topics:

They can also view metric anomaly alerts in the Alert Viewer and associated graphs that display metrics over time, baseline values of metrics, highlighting of anomalous metrics, and forecasting of metric values, as described in the following topics:

Note: Individual anomolous or non-anomalous metrics can be viewed on a timeseries chart using the Metric search function located on the Cloud Pak for AIOps console.

Prerequisites

Metric anomaly detection must be trained before anomalies are generated. For more information about training metric anomaly detection based on time series from integrated data sources, see Setting up training for metric anomaly detection.

To train metric anomaly detection, you need time series data that takes measurements at regular intervals.

  • Up to 14 days of the most recent timeseries data are used to build the models.
  • Models require 50% of their training window to have data before they are built and used to detect anomalies. For more information, see Description.
  • Models that do not satisfy the 50% data availability requirement might raise anomalies with simple baseline. Anomalies from simple baseline are always produced. Simple baseline is trained with a minimum of 4 data points.

Data ingestion

Data comes from one of the following sources

  • Ingest metric data by posting data using the REST API. For more information, see Metrics API and the POST entry in the associated Swagger document.
  • Check that ingested data is being ingested by retrieving metric data using the REST API. For more information, see Metrics API and the GET entry in the associated Swagger document.
  • If you have a metric data integration set up (Instana, Dynatrace, New Relic, Zabbix, for example), you can ingest data using the relevant integration, as described in Integration.

Automating metric anomalies to create incidents

You can automatically create incidents from a subset of anomalies, for example, if you want to allow for better prioritization. You need to create a policy to promote alerts to an incident like you would for any other alert. For more information, see Promote alerts to an incident.

The following example shows a policy to promote alerts to an incident:

Policy template Alerts to Incidents
Figure. Policy template alerts

This policy means that the next time there is a new alert on resource1 or resource2 for metric1 or metric2–or any metric in group3–then an incident is automatically created for it.

For more information, see Promote alerts to an incident.

Language support

For information about supported languages for this algorithm, see Language support.