Log-based machine learning overview

The log-based machine learning system in IBM Z® Anomaly Analytics detects anomalies in z/OS® system log data.

Data flow

The following steps describe the data flow among the components of the log-based machine learning system in IBM Z Anomaly Analytics. The step numbers correspond to the numbers that are shown in Figure 1.
  1. Historical or streaming z/OS SYSLOG data is collected by the Z Common Data Provider and forwarded to Apache Kafka.
  2. The log-based machine learning system subscribes to the Apache Kafka SYSLOG topic and parses the SYSLOG data.
  3. Incoming log data is summarized, and the summaries are stored in a relational database.
  4. If a log data model is available, the summarized data is compared to the model to identify abnormal behavior.
  5. The score results are written to the Apache Kafka message broker and are stored in a NoSQL long-term data store.
  6. The rules engine polls Apache Kafka for anomalous events that exceed a configured threshold. Highly anomalous events are forwarded to the Apache Kafka Ensemble-Event topic.
  7. Ensemble subscribes to the Ensemble-Event topic, and it groups these anomalous events, together with anomalous events from the metric-based machine learning system, into event groups by resource (system or subsystem). Ensemble calculates the severity and confidence score of each event group. The results are stored in a NoSQL long-term data store, and they are forwarded to the Apache Kafka Ensemble-EventGroupNotification topic.
  8. When a user accesses the ensemble GUI, the score results are retrieved from the NoSQL data store and rendered in the GUI.
Figure 1. Log-based machine learning overview
The illustration shows the flow of data among the primary components, as described in the text.

Ensemble GUI for log anomaly detection

In the ensemble GUI, when you click the Log-Based Anomalies page, the following high-level pages are available from the menu:
Analysis
The Analysis page is the home page that is shown in the content pane when you first click the Log-Based Anomalies page. On this page, you select the date and the systems that you want to monitor for anomalies. For each hour of the selected day, the highest anomaly score is shown for each monitored system. Click a score to view a bar graph of the associated 24-hour period. In the resulting graph, each column represents a 10-minute interval within the hour. Click a column in the graph to view more information about the interval.

For log-based anomalies, you can selectively ignore the z/OS messages of your choice in the analysis in future intervals. This function can be helpful, for example, if irrelevant messages are frequently shown in the analysis intervals, and you want to reduce the clutter so that you can more easily notice significant issues in the subsystem. To access this function, click Actions > Ignore Message from the Analysis Details page. If you choose to ignore a message, you also have the option of choosing when to remove it from the ignore list. For example, you can choose to ignore it until the next training occurs, or you can ignore it until is manually selected to be restored.

Message History
You can view the message history, which shows all occurrences of a message across all monitored systems.
Notifications
The Notifications page includes messages from the components of the log-based machine learning system that indicate some type of activity in the system that you must be aware of, or must respond to. For example, they include information about the success or failure of the training phase.
System Status
The Systems page shows status information for the monitored systems, including the status of the data pipeline. The status of the data pipeline indicates the overall health of the components that are involved in processing the log data through the log-based machine learning system.
Administration > Training Management
The Training Management page lists the training status values and other details for z/OS monitored systems. It includes an Actions menu. To manage the ignore status of messages, to request training, or to manage the training dates for a system, select the checkbox for the system, and from the Actions menu, select the relevant action.
From the Training Management page, administrators can now perform the following actions to manage the training for monitored systems:
  • Selectively ignore, or reinstate, the z/OS messages of your choice in the analysis in future intervals.
  • Request training, and view the status of a training request.
  • View the training dates for the current model and for the model in the next training period. For example, you can view the following dates:
    Current model begin date
    The date on which the building of the current model begins.
    Current model trained date
    The date on which the training of the current model is complete.
    Next training period begin date
    The earliest date for which data is included when the next model is automatically built. The next model will include data from between midnight on this begin date and 23:59:59 on the day before the next model begin date.

    For example, if this begin date is 26 December 2022, and the training period is 90 days, the next model will include data (if data is available) from between midnight on 26 December 2022 and 23:59:59 on 25 March 2023 (which is the day before the Next model begin date of 26 March 2023).

    Next model begin date
    The date on which the automatic building of the next model begins.
  • Select dates to include or exclude from future training. For example, you might want to exclude specific dates in a future training due to very abnormal behavior that occurred on those dates. However, you might want to include those dates again later for another future training.
Restriction: The training management capability is available only for messages that are issued by z/OS monitored systems.
Administration > Configuration
From the Configuration page, you can change the default training period and training interval for a z/OS monitored system. The training period indicates how many days of data to include in a model. The training interval indicates how often to automatically recreate the model.
training period
The number of consecutive calendar days that the log-based machine learning system uses to identify the monitored system data to include in training models. The default value is 90 days.

The log-based machine learning system expects 90 days of historical log data to create a reasonable model. For some workload patterns, 30 days of historical log data might be acceptable for building a model, but might result in a suboptimal model.

training interval
The number of consecutive calendar days between automatic builds of system behavior models. The default value is 30 days.

Variable analysis

If you enable variable analytics, the log-based machine learning system extracts the variables from each z/OS SYSLOG message and analyzes the variable values to determine how rare the values are. The rarity of the values is assessed in comparison to the values of the same variable in all the messages that have the same message ID and message text.

Variable analytics uses historical patterns to determine the rarity of the variable values in a message. The historical patterns are analyzed and stored in a model. By default, the log-based machine learning system waits to receive at least 1 day of z/OS SYSLOG messages before it builds an initial model for variable analysis. A new model will be built once a day by using a maximum of 15 days of z/OS SYSLOG messages. When a model is available for a z/OS system, the z/OS SYSLOG messages from that z/OS system will be analyzed for rarity.

The log-based machine learning system then assigns a variable rarity score to the message. In the ensemble GUI, this score is shown for each message in the analysis results for log-based anomalies.

The range for the rarity score is 0-1, where 0 means that the variable value is not rare, and 1 means that the variable value is the rarest. If a rarity score cannot be assigned, it is shown as Not Available. The following situations are some reasons why a rarity score cannot be assigned:
  • The message has no variables.
  • The message is not included in the analysis model.
  • Not enough data is available to calculate the variable rarity score.

To enable variable analysis in z/OS SYSLOG messages, and to further refine the configuration for variable analysis, review and update the configuration file ZOA_HOME/zoa_env.config, as described in Post-installation configuration of the software containers.