About log anomaly detection algorithms

IBM Cloud Pak® for AIOps processes incoming logs as part of the log anomaly detection process.

There are two log anomaly detection AI algorithms, each of which can run independently of the other. If both algorithms are enabled, then any log anomalies discovered by both will be reconciled, so that only one alert is generated. In this case, the severity of the combined alert will be equal to the highest severity of the two alerts.

Site reliability engineers (SREs) and other users responsible for application and service availability are able to display log anomalies as alerts within the context of an incident, as described in Viewing relevant events. They will also be able to view these log anomaly alerts in the Alert Viewer, as described in About alerts.

Log anomaly detection algorithm
Figure. Log anomaly detection algorithm

The preceding diagram highlights the machine learning models that are used within Cloud Pak for AIOps. The models utilize advanced algorithms to analyze data from various sources, including logs, metrics, topology, and alerts. The analyzed data is used to detect anomalies, predict changes, and provide real-time insights.

The primary goal of these Cloud Pak for AIOps machine learning models is to enable IT teams to proactively identify and resolve issues before these issues impact the business. The advanced language models, deep NLP, and semantic parsing enable IT teams to extract and summarize incident information quickly and accurately. The machine learning models identification of patterns and trends enables IT teams to quickly identify the root cause of the issues, take corrective actions to prevent downtime, help ensure optimal system performance, and security, ultimately leading to improved efficiency, reliability, and security of IT operations.

With Cloud Pak for AIOps policies, IT operators can automatically assign runbooks to events based on pre-configured conditions. They can also identify the root cause and scope of affected components using blast radius and fault localization. Additionally, operators can execute assigned runbooks automatically or manually using operator-configurable policies.

Cloud Pak for AIOps policies automate routine tasks and provide real-time system performance visibility, enable teams to streamline workflows, enhance incident response, reduce downtime, and boost overall efficiency.

Prerequisites

Log data

Logs are sent from an application environment to a logging system, such as Mezmo or Splunk, through data shippers, such as Logstash. Each logging system has a mapping definition in Cloud Pak for AIOps.

The Log anomaly detection algorithms require log data from one or more of your log systems. Before attempting to set up training for these algorithms, ensure that you have created a functioning data integration to at least one of the log systems listed in the following table.

Table. Log system details
Log system For more details, see:
Falcon LogScale Creating Falcon LogScale integrations
Mezmo Creating Mezmo integrations
Splunk Creating Splunk integrations

You can also load log data into the system using one of these generic data loading methods.

Table. Data loading method details
Data loading method For more details, see:
Custom Creating custom integrations
ELK Creating ELK integrations
Kafka Creating Kafka integrations

Cloud Pak for AIOps has various integrations with log aggregators such as ELK (Elasticsearch, Logstash, Kibana), Falcon LogScale, Mezmo, Splunk, and comes with a REST service for custom integrations. These integrations connect to the source and pull the log data. If you do not have a log aggregator that is compatible with Cloud Pak for AIOps, there is an alternative option that utilizes Kafka-based integration. The Kafka-based integration also supports log processing where historic logs are fed in offline mode. Additional time needs to be allocated for planning and development in this scenario, since custom logic might be needed to parse and transform the logs in to a format that is expected by Cloud Pak for AIOps.

Performance tuning

Before training can occur, logs are pulled from a logging system, such as Mezmo or Splunk, and are stored in Elasticsearch.

Before attempting to set up training for these algorithms, review the performance tuning tasks at the following location: Setting up training for log anomaly detection.

Language support

For information about supported languages for this algorithm, see Language support.

Algorithms

The algorithms are as follows:

Best practices

Log anomaly detection algorithms are useful when you have logs that represent a healthy system. When Cloud Pak for AIOps is trained on a healthy system, anomalous problems can then be detected. For instance, if the infrastructure or application logs are accurate real-time representations of system health, then log anomaly alerts can help to minimize the mean time to error detection, which includes the early warning signs of potential problems that might occur later.

Log anomaly detection is resource-intensive, and each log anomaly algorithm has different data requirements. You are recommended to limit log analysis to your infrastructure and applications that are considered business critical to ensure that you do not see hardware requirements increase significantly.