About log anomaly detection - golden signals

Detect log anomalies with golden signals.

Log anomaly detection - golden signals, introduced in Version 4.7.0, is an algorithm that improves signal-to-noise management of anomaly detection over previous log anomaly detection algorithms. It uses autoclassification of log templates to pre-filter logs of interest and provides administrative control to further tune the anomalies that are raised as alerts. It also improves anomaly alert explainability by providing anomaly periods, baselines, and forecasts in a chart and access to raw logs.

How do golden signals help with log anomaly detection?

Working with metric anomaly detection, Log anomaly detection - golden signals discovers important log message patterns and detects when they change significantly. The algorithm takes in raw log messages and converts the textual information to log patterns with both known and unknown variables and static content. These log patterns are called templates. Domain knowledge is used to assign a golden signal type to each template.

After this templatization, log data is filtered based on the golden signal type and converted to metric data. This converted metric data is passed through the metric anomaly detection algorithm to train models and infer anomalies.

IT operations teams can use the alerts that are generated to identify the log patterns that deviated from normal behavior and the time period in which the abnormalities were observed. Teams can use this information for root cause analysis to help resolve incidents.

Note: If the Log anomaly detection - golden signals algorithm is enabled, the Log anomaly detection - natural language or Log anomaly detection - statistical baseline algorithms cannot be active. Likewise, if Log anomaly detection - natural language or Log anomaly detection - statistical baseline is enabled, Log anomaly detection - golden signals cannot be active.

What is a golden signal?

Golden signals help IT operations teams to classify and organize events when they diagnose incidents.

  • Seven types of golden signals exist within IBM Cloud Pak® for AIOps.
  • Golden signal types can fall into the following classes:
    • Causes include latency, error, and availablity.
    • Effects include exception, traffic, saturation, and information.
    • The information type is not assigned to any class and shows up with none.

Golden signal types

  • Latency: These events include a duration and are not related to resources.

    • Examples include transaction load, transfer, or any other metrics where the unit of measurement is time.
    • Example events include disk read/write times and I/O wait times.
  • Error: These events are associated with a timeout, drop, failure, reset, retry, retransmission of request, packets, or a notification.

    • An error might also be understood as a limiting condition of latency. Examples include HTTP code 5xx, requests failed, crash affected user rate or count, count of errors, or any user actions with errors, error count, or rate associated with reported or requests, and event failure.
  • Availability: These events are indicative of the availability or readiness of the resource.

    • Examples include HTTP code 4xx, host unhealthy, resource availability, readiness, and health status.
  • Exception: These events are composed of stack trace or any indication of other errors in the log data.

    • Examples include DSRA0010E : SQL State = 08S01, Error Code = 18.456: (FailedLoginException) Login error: com.ibm.security.krb5.KrbException, and status code: 6 message: Client not found in Kerberos database.
  • Traffic: These events represent the volume or rate of transactions that flow through the system.

    • Examples include HTTP code 2xx, traffic in/out, bytes received/sent/outstanding, disk/bytes read/write operations, throughput, service idle time, and HTTP 2xx/4xx/5xx success counts.
    • Events that represent idle time might also be classified as traffic because they might indicate that a service is experiencing idle time, indicating low traffic.
  • Saturation: These events can be measured in terms of count or percentage and can be exhausted and throttled. These events are also resource-oriented, such as CPU, GPU, memory, swap, kernel thread, cache, containers, link, connection, process, queue, disk, and session.

    • At the infrastructure level, such resources might be as granular as CPU, GPU, memory, and disk space or as broad as a Kubernetes cluster, pod, and VM.
    • At the middleware level, such resources might include connection pool and thread pool.
    • At the application level, such resources might be user sessions.
  • Information: Events that do not fall into any of the previous golden signal types are assigned as information logs, for example, <<UNKNOWN TYPE>> on Success type = 0 data=true.

Enabling or disabling the log anomaly golden signals pipeline

To enable the pipeline, complete the following steps:

  1. Before you enable the log anomaly golden signals pipeline, disable the Log anomaly detection - natural language and the Log anomaly detection - statistical baseline algorithms from the AI Model Management UI.
    AI Model Management UI page
    Figure. AI Model Management UI page
  2. Enable the metric anomaly detection training schedule from the AI Model Management UI.
  3. Click Set up training and complete the steps in Starting the training setup to enable the pipeline.
  4. Verify that the Log anomaly detection - golden signals policy is enabled so the log anomaly golden signals pipeline can run as expected. The policy is enabled by default. After the pipeline runs, it produces an alert.

To disable the pipeline, complete the following steps:

  1. Click the Log anomaly detection - golden signals tile within the AI Model Management page and delete the training definition.
    Log anomaly detection - golden signals UI page
    Figure. Log anomaly detection - golden signals UI page
  2. Enable the Log anomaly detection - natural language or the Log anomaly detection - statistical baseline algorithms from the AI Model Management UI. Now the pipeline is configured not to use golden signals, and all incoming log lines are instead sent to the log anomaly detection pipeline.

How to set up a training definition for log anomaly detection - golden signals

For more information, see Setting up training for log anomaly detection - golden signals.

Handling large log message loads for template training

Starting with IBM Cloud Pak® for AIOps version 4.7.1, log message sampling is enabled by default. The training initially runs every 10 minutes until at least 100,000 log messages are processed. Then, the training runs once every hour. Regardless of the log message load, 450,000 logs are sampled by default for every training run.

If the log message load contains more than 10 million logs each hour, tune some parameters to override the defaults with one of the following options.

Option 1

Increase the pod resources and the MINIMUM_SAMPLE_SIZE_PER_TRAINING_RUN and MAXIMUM_SAMPLE_SIZE_PER_TRAINING_RUN values.

  1. After you log in to the terminal with the oc login command, edit the deployment file by entering the following command:

    oc edit deployment aimanager-aio-log-anomaly-golden-signals
    
  2. Modify the values in the container.resources file to the following values:

    containers:
      - resources:
          limits:
            cpu: '3'
            memory: 6Gi
          requests:
            cpu: '2'
            memory: 4Gi
    
  3. In the env section, add or modify the following values:

    - name: MINIMUM_SAMPLE_SIZE_PER_TRAINING_RUN
      value: '800000'
    - name: MAXIMUM_SAMPLE_SIZE_PER_TRAINING_RUN
      value: '900000'
    

Option 2

Alternatively, run the training more frequently even after the initial 100,000 log messages are processed by modifying the JOB_FREQUENCY_TIME value.

  1. After you log in to the terminal with the oc login command, edit the deployment file by entering the following command:

    oc edit deployment aimanager-aio-log-anomaly-golden-signals
    
  2. In the env section, add or modify the following value to change the number of minutes:

    - name: JOB_FREQUENCY_TIME
      value: '10'
    

Viewing templates

Now the training setup is complete. The template training continuously runs in the background.

After the data integration collects the threshold number of log messages, the algorithm triggers template training on the data that is collected. The algorithm starts to process the log messages to categorize them into the required templates.

When template training gets completed, you can see a summary count of all the templates that got generated, as shown in the following example:

Summary count of all templates
Figure. Summary count of all templates

Any template with a golden signal type of Information gets categorized as a disabled template, and all other templates with a golden signal type other than Information get categorized as enabled.

You can view more details about these templates by clicking the Templates tab.

View templates
Figure. View templates

The templates are displayed in a tabular format with the following details:

  • The Template name section shows the log pattern that matches the input log messages.
    • Logs that were not matched to any existing template are assigned to a Template name that includes unmatched in its name. A specific unmatched template is assigned based on the golden signal that is classified on the log content.
    • Click a Template name to view template Details.
  • The ID numbers indicate the unique template ID value that is assigned by the training algorithm.
  • The Alerts last 24 hours and Alerts last 2 weeks sections display the alert count if an alert was generated for that particular template in the last 24 hours or 2 weeks. If any log message patterns or frequencies change unexpectedly, the template triggers an alert to be generated.
  • The Log messages last 24 hours and Log messages last 2 weeks sections show the count of any log messages that were collected in the last 24 hours or 2 weeks that match a specific template pattern.
  • The Type section indicates whether the template is Custom, Generated, or Default.
    • You can create a Custom template with Add template.
    • A Generated template is created from the training algorithm.
    • All unmatched templates are categorized as Default.
  • The Golden signal type section shows the golden signal type that is attached to a template. Open the menu for the template that you want to modify. Click Set golden signal to select a golden signal type, or click Set state to set the state.
  • The State section shows enabled or disabled based on the golden signal type for the specific template. Enabled templates are used to track log message patterns over extended periods of time. This data is then sent to metric anomaly detection. Alerts are generated when log message patterns or their frequencies change unexpectedly.

If the counts against template patterns from the log anomaly detection - golden signals algorithm are not updated in the training UI table, enable historic alert storing in Elasticsearch to access alert counts. For more information, see Counts against template patterns are not updated in the training UI.

Viewing unmatched templates

Log messages that don't have any common pattern to them are grouped based on their golden signal type and matched to a template pattern with an unmatched_<golden_signal_type> format.

After enough unmatched log messages reach the threshold count, an incremental training run is triggered. The incremental training might generate new log template patterns to help match the log lines that are similar to current unmatched logs and add the logs to a template.

Click a Template name of an unmatched template to view template Details.

Viewing log messages

View up to 10,000 log messages that match a template pattern.

  1. Click a Template name. Then, click the Log messages tab.
    • Received shows timestamps for when the messages were received.
    • Log message shows the log message content that matches the template pattern.
  2. Click Export to CSV file to generate a CSV file if you want to further analyze the data.

Viewing metric anomalies in the Alert Viewer

For more information about viewing Log anomaly detection - golden signals metric anomalies in the Alert Viewer, see Viewing metric anomaly details.

Adding templates

If the automatically generated templates didn't capture a pattern that is important to you, click Add template to add your own custom template patterns.

  1. From the Templates tab, click Add template. The New custom template page opens.
  2. Enter an optional Template name.
  3. Enter a Sample log message for the pattern that needs to be generated.
  4. Click Generate pattern. Based on prior knowledge and predefined variable rules, the model generates a Log message pattern for the sample log message. Along with the pattern, a default Golden signal type is assigned based on the log message that you entered. The State is Enabled by default. However, you can modify these values.
  5. To edit the generated pattern, click Edit. Use the menus to customize the log message pattern. For example, to change a variable type, click an Unknown Type menu and select the option that you want.
  6. Click Done.
  7. If the assigned category isn't correct, select a different Golden signal type.
  8. Select the State as needed. If the template is Disabled, models aren't trained with it.
  9. When your template is ready for use, click Create template.

Editing templates

Edit templates from the Templates tab or the Details tab.

Edit a template from the Templates tab:

  1. Find the template that you want to edit in the table.
  2. Click Golden signal type. A menu opens.
  3. Select Set golden signal type to set the golden signal type of the template or select Set state to enable or disable the template.
  4. If the template Type is Custom, you can also edit the Template pattern.

Edit a template from the Details tab:

  1. In the table, find the template that you want to edit and click the Template name. The template Details tab opens.
  2. Update the Template name, select a Golden signal type, or change the State to Enabled or Disabled.
  3. If the template Type is Custom, you can also edit the Template pattern.
  4. After your changes are complete, click Save.

Deleting templates

You can delete any templates except Default templates.

  1. Find the template that you want to delete in the table.
  2. Click Template edit options (the three dots at the end of the row) and select Delete.