Training Log anomaly models for Watson AIOps AI Manager
Train a model for the log anomaly service of IBM Watson® AIOps AI Manager.
You can train log anomaly models either by the command line, or using the IBM Watson AIOps console. The console provides you with the commands that you need to run to train your model.
- Mapping event data for ingestion
- Training and retraining from the console
- Training and retraining from the command line
- Tuning event grouping training parameters
Mapping log data for ingestion
You can map your log data to a normalized JSON format that AI Manager can use to train models. You're not necessarily limited to providing data using the default options as long as your log data includes an appropriate codec
value
and uses the following JSON sample files as mappings.
Note: LogDNA is the default expected data format for log anomaly model training. You do not need to apply a mapping on LogDNA-based training.
Follow these steps to apply a mapping to your training data:
-
Navigate to your training ingestion folder to place the mapping:
cd /train/ingest_configs/log/
-
Copy the appropriate sample mapping file (in this example, for Humio) to the
log
directory of the training folder:cp groupid-appid-ingest_conf.json.humio_example <app-group-id>-<app-id>-ingest_conf.json
The existing sample mapping files are named
groupid-appid-ingest_conf.json.<type>_example
. This file name represent the file naming convention of:<app-group-id>-<app-id>-ingest_conf.json
where<app_group_id>
is the application group ID and<app_id>
is the application ID for the model that you want to train.Important: The ingestion configuration JSON file must follow the naming convention of
<app-group-id>-<app-id>-ingest_conf.json
. -
Proceed with training.
Sample ingestion configuration JSON files
These sample mapping files are available in your training ingestion folder. These sample mappings override the default LogDNA mapping for log-based training.
ELK Stack
{
"mapping": {
"codec": "elk",
"rolling_time": 10,
"instance_id_field": "JOBNAME",
"log_entity_types": "host",
"message_field": "message",
"timestamp_field": "@timestamp"
}
}
Humio
{
"mapping": {
"codec": "humio",
"rolling_time": 10,
"instance_id_field": "kubernetes.container_name",
"log_entity_types": "kubernetes.namespace_name,kubernetes.container_hash,kubernetes.host,kubernetes.container_name,kubernetes.pod_name",
"message_field": "@rawstring",
"timestamp_field": "@timestamp"
}
}
Splunk
{
"mapping": {
"codec": "splunk",
"rolling_time": 10,
"instance_id_field": "sourcetype",
"log_entity_types": "host, index, source, sourcetype",
"message_field": "_raw",
"timestamp_field": "_time"
}
}
For more information about the log anomaly data schema and a sample of raw log data, see Configuring Log anomaly data.
Training and retraining from the console
-
Log in to the IBM Watson AIOps console.
-
In the My instances pane, in the section of the instance that you created, click View all.
-
From the My instances page, click the options menu (three vertical dots) of the corresponding instance that contains your application group, then click Open.
-
From your specific instance page (the name of the instance that you created), click the Name of the application group that contains the application that you want to train.
-
From your specific application group page (the name of the application group that you created), click the Name of the application that you want to train.
-
Click Insight models then click Configure (the pencil icon) for Logs model.
The Run model training for log anomalies page opens. You must chose between First time training or Retraining to configure your automatically populated training steps.
First time training
-
Copy and edit the command to log on to your cluster, replace the values for <username> and <password>, then enter the edited command into your model training console.
-
Copy and edit the command to select the namespace where you installed AI Manager, replacing the value for <namespace>, then enter the edited command into your model training console.
-
Copy and enter the
exec
into your model training console. -
Set a version number for your new model and click Generate scripts.
Note: The Set version field must meet the following requirements:
- Characters must be lowercase or numeric.
- Values do not include \, /, *, ?, ", <, >, |, `, ~, (whitespace character), ,, #, or :.
- Values cannot start with -, _, or +.
- Value cannot be . or .. alone.
You can see additional steps that contain scripts to run in your model training console. The additional steps are:
- Prepare your training data set, and save it to the selected data bucket, following the structure noted.
- Access the selected training folder.
- Run your training pipeline.
- Verify your models (optional).
- Evaluate your models (optional).
Retraining
-
Copy and edit the command to log on to your cluster, replace the values for <username> and <password>, then enter the edited command into your model training console.
-
Copy and edit the command to select the namespace where you installed AI Manager, replacing the value for <namespace>, then enter the edited command into your model training console.
-
Copy and enter the
exec
into your model training console. -
Set a version number for your new model.
-
Select a date range for the data you want to retrain your model on, then click Generate scripts
You can see additional steps that contain scripts to run in your model training console. The additional steps are:
- Prepare your training data set, and save it to the selected data bucket, following the structure noted.
- Access the selected training folder.
- Run your training pipeline.
- Verify your models (optional).
- Evaluate models (optional).
- Deploy models (optional).
Training and retraining from the command line
Download the Watson AIOps log data quality checker (aiops-log-data-quality-checker
) from the GitHub samples repository.
-
EXEC
inside themodel-train-console
pod:-
Log in to your cluster as Administrator:
oc login https://my-ocp43.fyre.ibm.com:6443 -u kubeadmin -p <password>
-
Select a project/namespace where AI Manager is installed. This example assumes installation inside the
xyz
project:oc project xyz
-
EXEC
inside themodel-train-console
pod:oc exec -it $(oc get po |grep model-train-console|awk '{print $1}') bash
-
-
Train the log anomaly service:
-
Place the input data set inside the
$LOG_INGEST
bucket.$LOG_INGEST
, by default, initializes tolog-ingest
. You can access this bucket after successfully installing AI Manager. The following figure illustrates how multiple application groups and application IDs are arranged inside the MinIO bucket:$LOG_INGEST ├── <APPLICATION-GROUP-ID-1> │ ├── <APPLICATION-ID-1> │ │ └── <VERSION_NUM> │ │ ├── normal-1.json.gz │ │ └── normal-2.json.gz │ └── <APPLICATION-ID-2> │ └── <VERSION_NUM> │ └── all_normal.json.gz └── <APPLICATION-GROUP-ID-2> ├── <APPLICATION-ID-1> │ └── <VERSION_NUM> │ └── select_sources.json.gz └── <APPLICATION-ID-2> └── <VERSION_NUM> └── normal_mongo.json.gz
Run the data quality checker on your input data set with the following command:
python -f checker.py TEST_LOGS_FILE >results
Verify that your input data set is well formed. If the check identifies any issues, resolve them and rerun the script before training.
If you have mounted the
$LOG_INGEST
bucket onmodel-train-console
, use the following command to copy data inside the MinIO bucket.oc cp <APPLICATION_GROUP_ID folder> $(oc get po |grep model-train-console|awk '{print $1}'):/home/zeno/data/log_ingest/
If you have not mounted the
$LOG_INGEST
bucket onmodel-train-console
, then you might have to leverage a tool such as aws-cli. -
Update the parameter settings if needed. The default settings work for data sets with data points smaller than 1M. If your data set has more data points than that, update with the following env in
/home/zeno/train/manifests/s3fs-pvc/log_anomaly.yaml
:env: - name: PVC_LOG_ID value: PVC_LOG_ID - name: ES_ENDPOINT value: ES_ENDPOINT - name: ES_USERNAME value: ES_USERNAME - name: ES_PASSWORD value: ES_PASSWORD - name: ES_CACERT_RAW value: ES_CACERT_RAW - name: NUM_TEMPLATE_TRAIN value: "50000000" - name: NUM_LOG_MATCH value: "1000000000" - name: CHUNK_SIZE value: "500000" - name: PREFIX_FINDER_ENABLED value: "False"
-
Run the following steps inside the
model-train-console
pod to train thelog-anomaly
model:cd train python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM>
<APPLICATION_GROUP_ID>, <APPLICATION_ID>, and <VERSION_NUM> are required to launch training. <VERSION_NUM> is managed by you and can be any string.
By default, all of the logs and intermediate data sets will be cleaned by MinIO. Logs are automatically stored for seven days. If you do not want to store your logs after training, use the following command to launch your training:
python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM> --enable_storage_cleanup
For retraining the
log-anomaly
model, add the--retrain
flag as in the following command:cd train python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM> --retrain '{"start_time":"YYYY-MM-DDT00","end_time":"YYYY-MM-DDT00"}'
The preceding command assumes that you have kept the input data set required for retraining on the
$LOG_INGEST
MinIO bucket.Important: After you deploy your initial models, only use the retraining pipeline to train. Not doing so might result in loss of data.
If some of your applications failed at the training step, you can launch the training job to train a single application by entering the following command:
$ python3 train_pipeline.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM> --app <APP NAME>
-
Run the following cURL command inside the
model-train-console
pod to verify the existence of trained models:curl -u $ES_USERNAME:$ES_PASSWORD -XGET https://$ES_ENDPOINT/_cat/indices --insecure | grep <APPLICATION-GROUP-ID>-<APPLICATION-ID>-<VERSION-NUM>
The following example shows the Elastic search indexes for the application_group_id
g1
, the application_ida1
and the version_num0
:curl -k -u $ES_USERNAME:$ES_PASSWORD -XGET https://$ES_ENDPOINT/_cat/indices | grep g1-a1-0
The result might be as follows.
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3038 100 3038 0 0 18987 0 --:--:-- --:--:-- --:--:-- 18987 green open g1-a1-0-applications jFhrDn1yRh2v3prdxt1jaA 1 1 82 0 40.9kb 15.5kb green open g1-a1-0-pca_model EpP2ZiOGR5K8zVt-Mlk9AA 1 1 40 0 446kb 220.7kb green open g1-a1-0-pca_fe 3EmxI7tNRZOzPvnjvyDAkg 1 1 40 0 144.7kb 72.3kb green open g1-a1-0-templates KVX_3hNvRwCcKNv3FX96IA 1 1 253 0 308.9kb 130.8kb
-
Run the following steps inside the
model-train-console
pod to evaluate the trained log models. The following steps assume the presence of a raw evaluation data set and associated ground truth data set:Copy the raw evaluation data set to the MinIO bucket:
aws s3 cp <RAW-LOG-FILE> s3://$EVAL_BUCKET/<APPLICATION_GROUP_ID>/<APPLICATION_ID>/<VERSION>/log/raw/
Copy the ground truth data set (that is associated with the raw evaluation data set in the preceding step) to the MinIO bucket:
aws s3 cp <GROUND-TRUTH-FILE> s3://$EVAL_BUCKET/<APPLICATION_GROUP_ID>/<APPLICATION_ID>/<VERSION>/log/groundtruth/
Launch the evaluation pipeline with the following command:
cd train python3 training_pipeline.py -p "log-evaluation" -g <APPLICATION_GROUP_ID> -a <APPLICATION_ID> -v <VERSION_NUM>
Save the evaluation pipeline results to the MinIO bucket:
aws s3 ls s3://$EVAL_BUCKET/<APPLICATION_GROUP_ID>/<APPLICATION_ID>/<VERSION>/ evaluation_result/
-
Deploy the
log-anomaly
model if you are confident about the quality of the trained model. You might use evaluation reports to investigate the quality of the model. Note that deployment is automatic for on-boarding of the models, so only re-trained models will need explicit deployment. Run the following command inside themodel-train-console
pod to deploy the trained log model:cd train python3 deploy_model.py -p "log" -g <APPLICATION-GROUP-ID> -a <APPLICATION-ID> -v <VERSION_NUM>
-
Tuning log anomaly training parameters
Complete the following steps to tune log anomaly training parameters:
-
Locate the folder where your training scripts are stored:
cd /train/scripts/s3fs-pvc
-
Create a directory to edit the script files in, and extract them to that directory:
mkdir tmp cd tmp unzip ../log_anomaly.zip
-
Edit
log_anomaly.sh
to tune your training parameters. The following parameters can be tuned:export TEMPLATE_TRAIN_LIMIT=2000 export TREE_DEPTH=3 export TEMPLATE_MODEL_THRESHOLD=0.01 export NUM_TEMPLATE_TRAIN=100000 export TEMPLATE_OCC_LIMIT=3 export NUM_WORKERS=10 export NUM_LOG_MATCH=1500000 export CHUNK_SIZE=50000
-
Create a .zip file to replace the existing training .zip file and delete your temporary directory:
zip -r ../log_anomaly.zip log_anomaly.sh cd .. rm -rf tmp
You can proceed with training as described in the preceding sections.