Workflow of natural language log anomaly detection

Find out how natural language log anomaly detection processes your log data to generate log alerts.

The workflow of natural language log anomaly detection is made up of two parts:

Model generation

The following figure shows a simplified workflow illustrating how the natural language log anomaly detection algorithm generates an AI model from log data. This part of the workflow is known as training.

Workflow of the natural language log anomaly detection algorithm: model generation

  1. Algorithm is configured and trained

    The algorithm configuration includes specification of log data sources, the period of data to train on, whether to run training on a schedule or on demand, and whether to deploy the training automatically or manually. Training is launched and the model training process begins, as described in steps 2 to 6.

  2. Training jobs are identified

    Assume you have a microservice made up of multiple components such as orders, catalog, front-end and so on, each of these components with its own set of logs. As it is likely that each of the components has different logging behavior, the log anomaly detection algorithm creates a separate model for each component. Each of these models is trained by a separate training job.

    The logging data for the different components is differentiated using the instance-id parameter specified during log connection mapping. For an example of log connection mapping for a LogDNA log system, see Specifying field mapping.

    The following steps 3 to 6 explain the process for training a unique model for each component. The steps are repeated for each component identified. Note See Supported throughput rates for limits on throughput rates.

  3. Data is imported

    Based on the specification of log data sources in the algorithm configuration, data is imported from log data sources for a given component.

  4. Data is divided into 10 second slots

    The imported data is divided into 10 second time slots. For example, if two weeks' worth of data was impoirted, then that data is divided into x time slots based on the following calculation.

    2 weeks = 14 days = 336 hours = 20,160 minutes = 120,960 time slots

  5. Log patterns are identified and counted

    Within each time slot, the system identifies and counts log patterns. Log patterns are identified in the following way:

    1. Each log message is separated into variant and invariant parts. For example, consider the following log messages:

       2021-03-12T03:05:31.608355778+00:00 stdout F 
       {
         “name”:“@instana collector”,
         “__in”:5, 
         “hostname”:“cart-846b9595c9qcqfk”,
         “pid”:35261,
         “module”: “announceCycle/agentHostLookup”,
         “level”:2,
         “msg”:“Agent cannot be contacted via system ABC nor via default gateway 123. Scheduling reattempt 0
       }
      
      2021-03-12T03:05:33.608355778 +00:00 stdout F 
      {
        “name”:“@instana collector”,
         “__in”:5, 
         “hostname”:“cart-846b9595c9qcqfk”,
         “pid”:89561,
         “module”: “announceCycle/agentHostLookup”,
         “level”:2,
         “msg”:“Agent cannot be contacted via system DEF nor via default gateway 456. Scheduling reattempt 1
       }
      
      2021-03-12T03:05:35.321654987+00:00 stdout F 
      {
        “name”:“@instana collector”,
        “__in”:5, 
        “hostname”:“cart-846b9595c9qcqfk”,
        “pid”:75421,
        “module”: “announceCycle/agentHostLookup”,
        “level”:3,
        “msg”:“Agent cannot be contacted via system GHK nor via default gateway 789. Scheduling reattempt 2
      }
      

      These log messages can be expressed using the following log pattern, where the variant parts of the log messages are expressed using wildcards, such as <*> and <NUM>.

       <*> stdout F 
       {
         “name”:“@instana collector”,
         “__in”:<NUM>, 
         “hostname”:“cart-846b9595c9qcqfk”,
         “pid”:<NUM>,
         “module”: “announceCycle/agentHostLookup”,
         “level”:<NUM>,
         “msg”:“Agent cannot be contacted via <*> nor via default gateway <*>. Scheduling reattempt <NUM>
       }
      
    2. Log patterns are counted in each 10 second slot and means and standard deviations are stored for each log pattern.

  6. Model is created

    On completion of training, a model is created that includes log pattern statistics.

    Note Each time the algorithm is run, steps 2 to 5 are repeated from scratch and a new version of the model is generated.

Application of the model to the live log data stream

The following figure shows a simplified workflow illustrating how the natural language log anomaly detection algorithm applies the AI model to the live log data stream. This part of the workflow is known as inference, and it involves applying the findings from the model.

Workflow of the natural language log anomaly detection algorithm: model generation

  1. Model is deployed

    Depending on how the algorithm was configured a new version of the model is deployed automatically on completion of training, or manually.

  2. Review live stream for anomalies

    Now that we have a deployed model, the system starts monitoring the live log stream for anomalies, which are statistically significant deviations from the baseline. It evaluates each component’s logs every 10 seconds for anomalies.

  3. Log alerts are generated

    Log anomalies are generated at the end of each 10 second interval for each log line in which there is a statistical difference between the log pattern counts and the reference values and log patterns – or both – are identified in the live log stream that do not exist in the model.

    Based on this analysis, the log anomaly detection algorithms identifies one or more of the following log anomaly types:

    1. Pattern expected but not detected
    2. Pattern detected but not expected
    3. Pattern detected but not expected, and includes an error
    4. Pattern detected with lower frequency than expected
    5. Pattern detected with higher frequency than expected

      The algorithm creates a log alert for each log anomaly. The severity of the log alert is calculated based on the following considerations:

    6. All alerts generated by this algorithm are given an intial severity of 4 (Minor)

    7. If the pattern is detected but not expected (log anomaly type 2 above) or it is an error pattern, then the severity is increased to 5 (Major).
    8. If the pattern is detected with a ower frequency than expected (log anomaly type 4 above), then the severity is increased to 6 (Critical).