Workflow of statistical baseline log anomaly detection

Find out how statistical baseline log anomaly detection processes your log data to generate log alerts.

This algorithm extracts specific entities from the logs, such as short text strings indicating error codes and exceptions, and combined with other statistics, such as number of logs for each component, uses these as a baseline to discover abnormal behavior in your live log data.

These extracted entitles are not just a simple dictionary, but rather use clue words to interpret other data in the logs, such as numbers. For example, the number 500 appearing in a log message might just be a count, identifier, or some other value, but if it is accompanied by clue words, such as http status then the system knows that this refers to an error code. When enough of these deviations are detected to be statistically significant, then the system know there is an anomaly issue to report.

The following figure shows a simplified workflow of the statistical baseline log anomaly detection algorithm.

Workflow of the statistical baseline log anomaly detection algorithm

  1. Algorithm switched on

    The algorithm is first switched on. Note See Supported throughput rates for limits on throughput rates.

  2. Extract initial reference values

    At the end of the first 30 minute time interval, the algorithm reviews the log messages from the previous 30 minutes and extracts text entities from the logs. It saves the number of occurrences of each entity as reference values.

  3. Review live stream for anomalies

    Now that we have the reference values, the system starts monitoring the live log stream for anomalies, which are statistically significant deviations from the baseline. It evaluates each component’s logs every 10 seconds for anomalies.

  4. Extract values from next 30 minute interval and update reference values

    At the end of the next 30 minute time interval, the algorithm again reviews the log messages from the previous 30 minutes and as before extracts text entities and statistics from the logs. It then combines this with the previous baseline to compute a new baseline, which now covers 60 minutes worth of normal behavior data.

    Note If the live data being processed is not from the last 30 minutes then the algorithm will still look for anomalies but the reference values will not be updated. One reason why the last 30 minute time interval might contain older data is when there is a delay in data processing due to a temporary malfunction of a data processing component, such as a Flink job.

  5. Log alerts are generated

    The system is continuously monitoring the live log stream for anomalies and evaluates each component’s logs every 10 seconds for anomalies. Log anomalies are generated at the end of each 10 second interval for each log line in which there is a statistical difference between the number of entities identified and the reference values and numbers, or both, are identified and accompanied by clue words.

    Based on this analysis, the log anomaly detection algorithms identifies one or more of the following log anomaly types:

    • Entity detected but not expected, and includes an error
    • Entity detected with lower frequency than expected
    • Entity detected with higher frequency than expected

      The algorithm creates a log alert for each log anomaly and send these alerts to the in-cluster data store. Any alerts generated by this algorithm are assigned severity 5 (Major).

  6. Interval is potentially tagged as biased

    Based on the number of log anomalies detected in the previous step, a threshold calculation is performed. The number of log anomalies detected is divided by the number of windows and if the resulting value exceeds the predefined threshold of 0.15, then the entire interval is tagged as biased, and none of the the reference values are updated based on the entity extraction values for this interval.

  7. Processing of the live log stream continues

    The algorithm continues processing 30 minute time intervals in the live log stream.