Training Log anomaly models for Watson AIOps AI Manager

Train a model for the log anomaly service of IBM Watson® AIOps AI Manager.

You can train log anomaly models either by the command line, or using the IBM Watson AIOps console. The console provides you with the commands that you need to run to train your model.

Mapping event data for ingestion
Training and retraining from the console
- First time training
- Retraining
Training and retraining from the command line
Tuning event grouping training parameters

Mapping log data for ingestion

You can map your log data to a normalized JSON format that AI Manager can use to train models. You're not necessarily limited to providing data using the default options as long as your log data includes an appropriate codec value and uses the following JSON sample files as mappings.

Note: LogDNA is the default expected data format for log anomaly model training. You do not need to apply a mapping on LogDNA-based training.

Follow these steps to apply a mapping to your training data:

Navigate to your training ingestion folder to place the mapping:
```
cd /train/ingest_configs/log/
```
Copy the appropriate sample mapping file (in this example, for Humio) to the log directory of the training folder:
```
cp groupid-appid-ingest_conf.json.humio_example <app-group-id>-<app-id>-ingest_conf.json
```
The existing sample mapping files are named groupid-appid-ingest_conf.json.<type>_example. This file name represent the file naming convention of: <app-group-id>-<app-id>-ingest_conf.json where <app_group_id> is the application group ID and <app_id> is the application ID for the model that you want to train.

Important: The ingestion configuration JSON file must follow the naming convention of <app-group-id>-<app-id>-ingest_conf.json.
Proceed with training.

Sample ingestion configuration JSON files

These sample mapping files are available in your training ingestion folder. These sample mappings override the default LogDNA mapping for log-based training.

ELK Stack

{
  "mapping": {
    "codec": "elk",
    "rolling_time": 10,
    "instance_id_field": "JOBNAME",
    "log_entity_types": "host",
    "message_field": "message",
    "timestamp_field": "@timestamp"
  }
}

Humio

{
  "mapping": {
    "codec": "humio",
    "rolling_time": 10,
    "instance_id_field": "kubernetes.container_name",
    "log_entity_types": "kubernetes.namespace_name,kubernetes.container_hash,kubernetes.host,kubernetes.container_name,kubernetes.pod_name",
    "message_field": "@rawstring",
    "timestamp_field": "@timestamp"
  }
}

Splunk

{
  "mapping": {
    "codec": "splunk",
    "rolling_time": 10,
    "instance_id_field": "sourcetype",
    "log_entity_types": "host, index, source, sourcetype",
    "message_field": "_raw",
    "timestamp_field": "_time"
  }
}

For more information about the log anomaly data schema and a sample of raw log data, see Configuring Log anomaly data.

Training and retraining from the console

Log in to the IBM Watson AIOps console.
In the My instances pane, in the section of the instance that you created, click View all.
From the My instances page, click the options menu (three vertical dots) of the corresponding instance that contains your application group, then click Open.
From your specific instance page (the name of the instance that you created), click the Name of the application group that contains the application that you want to train.
From your specific application group page (the name of the application group that you created), click the Name of the application that you want to train.
Click Insight models then click Configure (the pencil icon) for Logs model.

The Run model training for log anomalies page opens. You must chose between First time training or Retraining to configure your automatically populated training steps.

First time training

Copy and edit the command to log on to your cluster, replace the values for <username> and <password>, then enter the edited command into your model training console.
Copy and edit the command to select the namespace where you installed AI Manager, replacing the value for <namespace>, then enter the edited command into your model training console.
Copy and enter the exec into your model training console.
Set a version number for your new model and click Generate scripts.

Note: The Set version field must meet the following requirements:
- Characters must be lowercase or numeric.
- Values do not include \, /, *, ?, ", <, >, |, `, ~, (whitespace character), ,, #, or :.
- Values cannot start with -, _, or +.
- Value cannot be . or .. alone.

You can see additional steps that contain scripts to run in your model training console. The additional steps are:

Prepare your training data set, and save it to the selected data bucket, following the structure noted.
Access the selected training folder.
Run your training pipeline.
Verify your models (optional).
Evaluate your models (optional).

Retraining

Copy and edit the command to log on to your cluster, replace the values for <username> and <password>, then enter the edited command into your model training console.
Copy and edit the command to select the namespace where you installed AI Manager, replacing the value for <namespace>, then enter the edited command into your model training console.
Copy and enter the exec into your model training console.
Set a version number for your new model.
Select a date range for the data you want to retrain your model on, then click Generate scripts

You can see additional steps that contain scripts to run in your model training console. The additional steps are:

Prepare your training data set, and save it to the selected data bucket, following the structure noted.
Access the selected training folder.
Run your training pipeline.
Verify your models (optional).
Evaluate models (optional).
Deploy models (optional).

Training and retraining from the command line

Download the Watson AIOps log data quality checker (aiops-log-data-quality-checker) from the GitHub samples repository.

EXEC inside the model-train-console pod:
1. Log in to your cluster as Administrator:
```
 oc login https://my-ocp43.fyre.ibm.com:6443 -u kubeadmin -p <password>
```
2. Select a project/namespace where AI Manager is installed. This example assumes installation inside the xyz project:
```
 oc project xyz
```
3. EXEC inside the model-train-console pod:
```
 oc exec -it $(oc get po |grep model-train-console|awk '{print $1}') bash
```

Train the log anomaly service:

Place the input data set inside the $LOG_INGEST bucket. $LOG_INGEST, by default, initializes to log-ingest. You can access this bucket after successfully installing AI Manager. The following figure illustrates how multiple application groups and application IDs are arranged inside the MinIO bucket:

 $LOG_INGEST
       ├── <APPLICATION-GROUP-ID-1>
       │   ├── <APPLICATION-ID-1>
       │   │   └── <VERSION_NUM>
       │   │       ├── normal-1.json.gz
       │   │       └── normal-2.json.gz
       │   └── <APPLICATION-ID-2>
       │       └── <VERSION_NUM>
       │           └── all_normal.json.gz
       └── <APPLICATION-GROUP-ID-2>
           ├── <APPLICATION-ID-1>
           │   └── <VERSION_NUM>
           │       └── select_sources.json.gz
           └── <APPLICATION-ID-2>
               └── <VERSION_NUM>
                   └── normal_mongo.json.gz

Run the data quality checker on your input data set with the following command:

 python -f checker.py TEST_LOGS_FILE >results

Verify that your input data set is well formed. If the check identifies any issues, resolve them and rerun the script before training.

If you have mounted the $LOG_INGEST bucket on model-train-console, use the following command to copy data inside the MinIO bucket.

 oc cp <APPLICATION_GROUP_ID folder> $(oc get po |grep model-train-console|awk '{print $1}'):/home/zeno/data/log_ingest/

If you have not mounted the $LOG_INGEST bucket on model-train-console, then you might have to leverage a tool such as aws-cli.

Update the parameter settings if needed. The default settings work for data sets with data points smaller than 1M. If your data set has more data points than that, update with the following env in /home/zeno/train/manifests/s3fs-pvc/log_anomaly.yaml:

 env:
   - name: PVC_LOG_ID
     value: PVC_LOG_ID
   - name: ES_ENDPOINT
     value: ES_ENDPOINT
   - name: ES_USERNAME
     value: ES_USERNAME
   - name: ES_PASSWORD
     value: ES_PASSWORD
   - name: ES_CACERT_RAW
     value: ES_CACERT_RAW
   - name: NUM_TEMPLATE_TRAIN
     value: "50000000"
   - name: NUM_LOG_MATCH
     value: "1000000000"
   - name: CHUNK_SIZE
     value: "500000"
   - name: PREFIX_FINDER_ENABLED
     value: "False"

Run the following steps inside the model-train-console pod to train the log-anomaly model:
```
 cd train
 python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM>
```
<APPLICATION_GROUP_ID>, <APPLICATION_ID>, and <VERSION_NUM> are required to launch training. <VERSION_NUM> is managed by you and can be any string.

By default, all of the logs and intermediate data sets will be cleaned by MinIO. Logs are automatically stored for seven days. If you do not want to store your logs after training, use the following command to launch your training:
```
 python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM> --enable_storage_cleanup
```
For retraining the log-anomaly model, add the --retrain flag as in the following command:
```
 cd train
 python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID>  -v <VERSION_NUM> --retrain '{"start_time":"YYYY-MM-DDT00","end_time":"YYYY-MM-DDT00"}'
```
The preceding command assumes that you have kept the input data set required for retraining on the $LOG_INGEST MinIO bucket.

Important: After you deploy your initial models, only use the retraining pipeline to train. Not doing so might result in loss of data.

If some of your applications failed at the training step, you can launch the training job to train a single application by entering the following command:
```
 $ python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM> --app <APP NAME>
```

Run the following cURL command inside the model-train-console pod to verify the existence of trained models:

 curl -u $ES_USERNAME:$ES_PASSWORD -XGET https://$ES_ENDPOINT/_cat/indices  --insecure | grep <APPLICATION-GROUP-ID>-<APPLICATION-ID>-<VERSION-NUM>

The following example shows the Elastic search indexes for the application_group_id g1, the application_id a1 and the version_num 0:

 curl -k -u $ES_USERNAME:$ES_PASSWORD -XGET https://$ES_ENDPOINT/_cat/indices | grep g1-a1-0

The result might be as follows.

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
 100  3038  100  3038    0     0  18987      0 --:--:-- --:--:-- --:--:-- 18987
 green open g1-a1-0-applications                    jFhrDn1yRh2v3prdxt1jaA 1 1    82 0     40.9kb  15.5kb
 green open g1-a1-0-pca_model                       EpP2ZiOGR5K8zVt-Mlk9AA 1 1    40 0      446kb 220.7kb
 green open g1-a1-0-pca_fe                          3EmxI7tNRZOzPvnjvyDAkg 1 1    40 0    144.7kb  72.3kb
 green open g1-a1-0-templates                       KVX_3hNvRwCcKNv3FX96IA 1 1   253 0    308.9kb 130.8kb

Run the following steps inside the model-train-console pod to evaluate the trained log models. The following steps assume the presence of a raw evaluation data set and associated ground truth data set:

Copy the raw evaluation data set to the MinIO bucket:
```
 aws s3 cp <RAW-LOG-FILE> s3://$EVAL_BUCKET/<APPLICATION_GROUP_ID>/<APPLICATION_ID>/<VERSION>/log/raw/
```
Copy the ground truth data set (that is associated with the raw evaluation data set in the preceding step) to the MinIO bucket:
```
 aws s3 cp <GROUND-TRUTH-FILE> s3://$EVAL_BUCKET/<APPLICATION_GROUP_ID>/<APPLICATION_ID>/<VERSION>/log/groundtruth/
```
Launch the evaluation pipeline with the following command:
```
 cd train
 python3 training_pipeline.py -p "log-evaluation" -g <APPLICATION_GROUP_ID> -a   <APPLICATION_ID> -v <VERSION_NUM>
```
Save the evaluation pipeline results to the MinIO bucket:
```
 aws s3 ls s3://$EVAL_BUCKET/<APPLICATION_GROUP_ID>/<APPLICATION_ID>/<VERSION>/  evaluation_result/
```
Deploy the log-anomaly model if you are confident about the quality of the trained model. You might use evaluation reports to investigate the quality of the model. Note that deployment is automatic for on-boarding of the models, so only re-trained models will need explicit deployment. Run the following command inside the model-train-console pod to deploy the trained log model:
```
 cd train
 python3  deploy_model.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v    <VERSION_NUM>
```

Tuning log anomaly training parameters

Complete the following steps to tune log anomaly training parameters:

Locate the folder where your training scripts are stored:
```
 cd /train/scripts/s3fs-pvc
```
Create a directory to edit the script files in, and extract them to that directory:
```
 mkdir tmp
 cd tmp
 unzip ../log_anomaly.zip
```

Edit log_anomaly.sh to tune your training parameters. The following parameters can be tuned:

 export TEMPLATE_TRAIN_LIMIT=2000
 export TREE_DEPTH=3
 export TEMPLATE_MODEL_THRESHOLD=0.01
 export NUM_TEMPLATE_TRAIN=100000
 export TEMPLATE_OCC_LIMIT=3
 export NUM_WORKERS=10
 export NUM_LOG_MATCH=1500000
 export CHUNK_SIZE=50000

Create a .zip file to replace the existing training .zip file and delete your temporary directory:
```
 zip -r ../log_anomaly.zip log_anomaly.sh
 cd ..
 rm -rf tmp
```

You can proceed with training as described in the preceding sections.