Workflow of statistical baseline log anomaly detection
Find out how statistical baseline log anomaly detection processes your log data to generate log alerts.
This algorithm extracts specific entities from the logs, such as short text strings that indicate error codes and exceptions. Combined with other statistics, such as the number of logs for each component, the algorithm uses these statistics as a baseline to discover abnormal behavior in your live log data.
These extracted entities are not just a simple dictionary, but rather use clue words to interpret other data in the logs, such as numbers. For example, the number 500
appearing in a log message might be a count, identifier,
or some other value, but if it is accompanied by clue words, such as http status
, then the system knows that this number refers to an error code. When enough of these deviations are detected to be statistically significant, then
the system knows it needs to report an anomaly issue.
Domain-specific log anomaly detection, such as that for IBM MQ or WebSphere, uses the same algorithm as statistical baseline log anomaly detection. However, the entities that are used in domain-specific log anomaly detection and that are included in the input logs are already provided from prior knowledge about the specific domain. After entities like the message ID and log level are detected, the workflow of a domain-specific log anomaly is similar to the workflow of a general statistical baseline log anomaly.
The following figure shows a simplified workflow of the statistical baseline log anomaly detection algorithm.
-
Algorithm switched on
The algorithm is first switched on. Note: See Supported resource number and throughput rates for limits on throughput rates.
-
Extract initial reference values
At the end of the first 30-minute time interval, the algorithm reviews the log messages from the previous 30 minutes and extracts text entities from the logs. It saves the number of occurrences of each entity as reference values.
-
Review live stream for anomalies
Now that it has the reference values, the system starts monitoring the live log stream for anomalies, which are statistically significant deviations from the baseline. It evaluates each component’s logs every 10 seconds for anomalies.
-
Extract values from next 30-minute interval and update reference values
At the end of the next 30-minute time interval, the algorithm again reviews the log messages from the previous 30 minutes. As before, the algorithm extracts text entities and statistics from the logs. It then combines this data with the previous baseline to compute a new baseline, which now covers 60 minutes' worth of normal behavior data.
Note: If the live data that is being processed is not from the last 30 minutes, then the algorithm still looks for anomalies, but the reference values are not updated. One reason why the last 30-minute time interval might contain older data is when a delay in data processing occurs due to a temporary malfunction of a data processing component, such as a Flink job.
-
Log alerts are generated
The system is continuously monitoring the live log stream for anomalies and evaluates each component’s logs every 10 seconds for anomalies. Log anomalies are generated at the end of each 10-second interval for each log line in which a statistical difference exists between the number of entities that are identified and the reference values and numbers, or both, that are identified and accompanied by clue words.
Based on this analysis, the log anomaly detection algorithm identifies one or more of the following log anomaly types:
- An entity is detected but not expected and includes an error.
- An entity is detected with lower frequency than expected.
- An entity is detected with higher frequency than expected.
The algorithm creates a log alert for each log anomaly and sends these alerts to the in-cluster data store. Any alerts that are generated by this algorithm are assigned severity 5 (Major).
-
Interval is potentially tagged as biased
Based on the number of log anomalies detected in the previous step, a threshold calculation is performed. The number of log anomalies that are detected is divided by the number of windows. If the resulting value exceeds the predefined threshold of 0.15, then the entire interval is tagged as biased, and none of the reference values are updated based on the entity extraction values for this interval.
Note: The predefined threshold ensures that frequently occurring anomalies do not bias the statistical baseline reference. If the anomalies are not frequently occurring, such as when the resulting value does not exceed the predefined threshold of 0.15, the interval is not tagged as biased, and the reference values are updated. It is possible that such infrequent anomalies are learned as normal behavior, and they might not be detected again. To reset the learning, you need to manually reset the reference of the statistical baseline. For more information, see Resetting the reference of the statistical baseline.
-
Processing of the live log stream continues
The algorithm continues processing 30-minute time intervals in the live log stream.