Table of contents

Watson™ AIOps AI Manager Prerequisites and requirements

There are prerequisites that must be satisfied to successfully install IBM Watson® AIOps AI Manager.

Prerequisites

Before installing AI Manager, you must install:

  • Red Hat OpenShift 4.3 or 4.5
  • IBM® Cloud Pak for Data 3.0.1
  • Kafka/Strimzi
    1. Log in to your OpenShift console
    2. Click Operators > Operator Hub
    3. Filter by Strimzi - select Strimzi and click Install
    4. Ensure the All namespaces on the cluster (default) option is checked. Leave other default options unchanged.
    5. Click Subscribe
  • IBM Netcool® Agile Service Manager 1.1.9
    Note: AI Manager bundles Netcool Agile Service Manager to use in the proper merging and tagging of uploaded topologies. To ensure proper sizing of Netcool Agile Service Manager for AI Manager, see the OCP sizing reference topic in the Netcool Agile Service Manager documentation.
  • AI Manager requires data from at least one supported data repository. For more information, see Supported data repository data schemas.

Limitations

  • Multi-instance: AI Manager supports deploying multiple service instances in a single IBM Cloud Pak for Data deployment, however this must be done by using tethered projects due to a limitation in Strimzi. If you want multiple service instances, install each instance into a tethered project together with its copy of Strimzi.

System requirements

Small instance

A small instance of AI Manager is the minimum supported size, and supports a single tenant with no High Availability. High Availability is not applicable for training. If a job fails, it restarts without any impact on the CPU. The data stores already are set up with High Availability by default. A small instance requires the following:

CPU and memory requirements
Component CPU Memory (GB)
AI Manager inference 11.8 26.00
AI Manager training* 2.9 6.50
Strimzi Kafka** 8.0 13.00
Elastic 9.0 9.00
Minio 1.0 8.00
PostgreSQL 0.8 2.00
Model training 0.6 2.25
Totals 34.1 66.75
*AI Manager training uses a queue, so at any particular time, this is the maximum amount of resources used.

** Prerequisite, and not part of licensed VPC count. CPU resource requirements for Strimzi are tied to OpenShift licenses.

AI Manager inference consists of:
Component Pod Replica
Log Anomaly Pipeline ibm-aiops----ibm-flink-task-manager 1
Log Anomaly Pipeline ibm-aiops----aio-log-anomaly-detector 1
Event Grouping Pipeline ibm-aiops----aio-event-grouping 1
Event Grouping Pipeline ibm-aiops----aio-alert-localization 1
Event Grouping Pipeline ibm-aiops----aio-topology 1
ChatOps ibm-aiops----aio-chatops-orchestrator 1
ChatOps ibm-aiops----aio-chatops-slack-integrator 1
Incident Similarity Pipeline ibm-aiops----aio-similar-incidents-service 1
AI Manager Platform ibm-aiops----aio-controller 1
AI Manager Platform ibm-aiops----aio-flink-zookeeper 3
AI Manager Platform ibm-aiops----ibm-flink-job-manager 2
AI Manager Platform ibm-aiops----aio-mock-server 1
AI Manager Platform ibm-aiops----aio-persistence 1
AI Manager Platform ibm-aiops----aio-core-tests 1
AI Manager Platform ibm-aiops----aio-addon 1
AI Manager Platform ibm-aiops----aio-model-train-console 1
AI Manager Training 1
Elastic ibm-aiops---ib-d0d-* 3
Elastic ibm-aiops----elasticsearch-test-* 1
Minio ibm-aiops----minio-* 3
Minio ibm-aiops----minio-test-* 1
PostgreSQL ibm-aiops----postgres-keeper-* 3
PostgreSQL ibm-aiops----postgres-proxy-* 2
PostgreSQL ibm-aiops---op-55c0-sentinel-* 3
Model train ibm-aiops----ibm-dlaas-lcm-* 2
Model train ibm-aiops----ibm-dlaas-ratelimiter-* 1
Model train ibm-aiops----ibm-dlaas-trainer-v2-* 2
Model train ibm-object-storage-plugin 1
Model train ibmcloud-object-storage-driver 1 per node
A small instance supports the following volume:
Pipeline Component Number of items Size
Logs 4,000 msg/sec 454 Bytes/msg
Netcool Insight® Operations Event groups 5 events/sec  
PagerDuty alerts 5 alerts/sec  
Incident queries 45 queries/min  
Medium instance

A medium scaled deployment consists of, approximately, two times the resources of a small instance; this enables High Availability, or can increase throughput. A medium instance requires the following:

CPU and memory requirements
Component CPU Memory (GB)
AI Manager inference 18.4 40.00
AI Manager training* 2.9 6.50
Strimzi Kafka** 8.0 13.00
Elastic 9.0 9.00
Minio 1.0 8.00
PostgreSQL 0.8 2.00
Model training 0.6 2.25
Totals 40.7 80.75
*AI Manager training uses a queue, so at any particular time, this is the maximum amount of resources used.

** Prerequisite and not part of licensed VPC count. CPU resource requirements for Strimzi are tied to OpenShift licenses.

AI Manager inference consists of:
Component Pod Replica
Log Anomaly Pipeline ibm-aiops----ibm-flink-task-manager 2
Log Anomaly Pipeline ibm-aiops----aio-log-anomaly-detector 2
Event Grouping Pipeline ibm-aiops----aio-event-grouping 2
Event Grouping Pipeline ibm-aiops----aio-alert-localization 2
Event Grouping Pipeline ibm-aiops----aio-topology 2
ChatOps ibm-aiops----aio-chatops-orchestrator 2
ChatOps ibm-aiops----aio-chatops-slack-integrator 2
Incident Similarity Pipeline ibm-aiops----aio-similar-incidents-service 2
AI Manager Platform ibm-aiops----aio-controller 2
AI Manager Platform ibm-aiops----aio-flink-zookeeper 3
AI Manager Platform ibm-aiops----ibm-flink-job-manager 2
AI Manager Platform ibm-aiops----aio-mock-server 1
AI Manager Platform ibm-aiops----aio-persistence 2
AI Manager Platform ibm-aiops----aio-core-tests 1
AI Manager Platform ibm-aiops----aio-addon 2
AI Manager Platform ibm-aiops----aio-model-train-console 1
AI Manager Training 1
Elastic ibm-aiops---ib-d0d-* 3
Elastic ibm-aiops----elasticsearch-test-* 1
Minio ibm-aiops----minio-* 3
Minio ibm-aiops----minio-test-* 1
PostgreSQL ibm-aiops----postgres-keeper-* 3
PostgreSQL ibm-aiops----postgres-proxy-* 2
PostgreSQL ibm-aiops---op-55c0-sentinel-* 3
Model train ibm-aiops----ibm-dlaas-lcm-* 2
Model train ibm-aiops----ibm-dlaas-ratelimiter-* 1
Model train ibm-aiops----ibm-dlaas-trainer-v2-* 2
Model train ibm-object-storage-plugin 1
Model train ibmcloud-object-storage-driver 1 per node
A medium instance supports the following volume:
Pipeline Component Number of items Size
Logs 8,000 msg/sec  
Netcool Insight Operations Event groups 10 events/sec  
PagerDuty alerts 10 alerts/sec  
Incident queries 85 queries/min  

Scaling the services

The highest volume of data is processed by the Data Ingest tasks on log data (running on Apache Flink). Performance of this task is highly dependent on the configuration being optimized for the data processed. There are three pieces of important information:
  • Number of applications in the log data: Each application maps to a micro-service. This also maps to the number of models that are created during log model training.
  • Number of Kafka topic partitions: The number of partitions allow consumers and producers to read in parallel.
  • Number of parallelism in Flink: This number is used to allow Flink to run a particular operator (step in the task) in parallel, potentially across multiple pods.
For best performance:
Number of apps (in log data) == Number of Kafka Topic Partitions == Number of Parallelism
For medium performance:
Number of apps (in log data) < Number of Kafka Topic Partitions == Number of Parallelism
And for the worst performance:
Number of apps (in log data) > Number of Kafka Topic Partitions < Number of Parallelism
Each instance of Flink (task manager) has 16 slots, each with 128 MB memory. Administrators can use the Parallelism configuration field to increase the number of instances run in parallel. Each of these uses a slot; the number of slots that are required is the sum of all the parallel configuration entries (across all connections). Since a particular replica has 16 slots, if more are required then the number of replicas should be increased to match.
Note: Adding extra slots beyond the numbers that are specified in the parallel parameter will not increase performance, because the slots remain idle.
For scaling to Small or Medium instances, you must create and invoke a scaling script:
  1. Run oc login
  2. Open a shell script editor, copy the following script, and paste it into the editor:
    #!/bin/bash
    
    set -e
    
    USAGE="$(
      cat <<EOF
    usage: $0
      [-i | --instance]         Instance name of the aiops installation
      [-n | --namespace]        Namespace where aiops instance is installed
      [-c | --config]           Config of AIOps, can be small medium large
    EOF
    )"
    
    usage() {
      echo "$USAGE" >&2
      exit 1
    }
    
    
    
    CURR_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"
    
    
    while (("$#")); do
      arg="$1"
      case $arg in
      -n | --namespace)
        if [ -n "$2" ] && [ "${2:0:1}" != "-" ]; then
          NAMESPACE=$2
          shift 2
        else
          echo "Error: Argument for $1 is missing" >&2
          exit 1
        fi
        ;;
      -i | --instance)
        if [ -n "$2" ] && [ "${2:0:1}" != "-" ]; then
          INSTANCE_NAME=$2
          shift 2
        else
          echo "Error: Argument for $1 is missing" >&2
          exit 1
        fi
        ;;
      -c | --config)
        if [ -n "$2" ] && [ "${2:0:1}" != "-" ]; then
          CONFIG=$2
          shift 2
        else
          echo "Error: Argument for $1 is missing" >&2
          exit 1
        fi
        ;;
      -h | --help)
        usage
        ;;
      -* | --*=) # unsupported flags
        echo "Error: Unsupported flag $1" >&2
        exit 1
        ;;
      esac
    done
    
    check_env_vars() {
        echo "========== Begin test for environment variables =========="
    
        if [ -z "${NAMESPACE}" ];
        then
            echo ">> NAMESPACE var not set... The -n or --namespace argument is required"
            exit 1
        fi
    
        if [ -z "${INSTANCE_NAME}" ];
        then
            echo ">>  INSTANCE_NAME var not set... The -i or --instance argument is required"
            exit 1
        fi
    
        if [ -z "${CONFIG}" ];
        then
            echo ">> CONFIG var not set... The -c or --config argument is required"
            exit 1
        fi
    
       
    }
    
    check_env_vars
    
    RELEASE="ibm-aiops---${INSTANCE_NAME}"
    component_label="component"
    
    component_label_v2="app.kubernetes.io/component"
    
    function scaleReplicas()
    {
        echo "scaling $2 with label $3=$4 with replicas $1"
        oc scale --replicas=$1 $2 -n $NAMESPACE -l "${3}=${4},app.kubernetes.io/instance=${RELEASE}" 
    }
    
    function main()
    {
        #ai4it apps
        addon_component_name="addon"
        alert_localization_component_name="alert-localization"
        chatops_orchestrator_component_name="chatops-orchestrator"
        chatops_slack_integrator_component_name="chatops-slack-integrator"
        controller_component_name="controller"
        event_grouping_component_name="event-grouping"
        log_anomaly_detector_component_name="log-anomaly-detector"
        mock_server_component_name="mock-server"
        model_train_console_component_name="model-train-console"
        persistence_component_name="persistence"
        similar_incidents_service_component_name="similar-incidents-service"
        topology_component_name="topology"
        qes_component_name="qes"
    
        #postgres Apps
        postgres_sentinel_component_name="stolon-sentinel"
        postgres_proxy_component_name="stolon-proxy"
        postgres_component_keeper="stolon-keeper"
    
        #dlaas
        lcm_component_name="lcm"
        ratelimiter_component_name="ratelimiter"
        trainer_v2_component_name="trainer-v2"
    
        #flink
        job_manager_component_name="job-manager"
        task_manager_component_name="task-manager"
    
        #gateway
        gw_component_component_name="gw-deployment"
    
        #minio
        minio_server_component_name="server"
    
        #elasticsearch
        elasticsearch_server_component_name="elasticsearch-server"
    
    
        if [ $CONFIG == 'small' ]; then
            echo "small config set"
            #scale ai4it-icp chart replicas
            scaleReplicas 1 deployment $component_label_v2 $addon_component_name
            scaleReplicas 1 deployment $component_label_v2 $alert_localization_component_name
            scaleReplicas 1 deployment $component_label_v2 $chatops_orchestrator_component_name
            scaleReplicas 3 deployment $component_label_v2 $chatops_slack_integrator_component_name
            scaleReplicas 1 deployment $component_label_v2 $controller_component_name
            scaleReplicas 1 deployment $component_label_v2 $event_grouping_component_name
            scaleReplicas 1 deployment $component_label_v2 $log_anomaly_detector_component_name
            scaleReplicas 1 deployment $component_label_v2 $mock_server_component_name
            scaleReplicas 1 deployment $component_label_v2 $model_train_console_component_name
            scaleReplicas 1 deployment $component_label_v2 $persistence_component_name
            scaleReplicas 1 deployment $component_label_v2 $similar_incidents_service_component_name
            scaleReplicas 1 deployment $component_label_v2 $topology_component_name
            scaleReplicas 1 statefulset $component_label_v2 $qes_component_name
    
    
            #scale postgres subchart replicas
            scaleReplicas 2 deployment $component_label $postgres_proxy_component_name
            scaleReplicas 3 statefulset $component_label $postgres_component_keeper
            scaleReplicas 3 deployment $component_label $postgres_sentinel_component_name
    
            #scale dlaas subchart replicas
            scaleReplicas 2 deployment $component_label_v2 $lcm_component_name
            scaleReplicas 1 deployment $component_label_v2 $ratelimiter_component_name
            scaleReplicas 2 deployment $component_label_v2 $trainer_v2_component_name
    
            #scale flink subchart replicas
            scaleReplicas 1 deployment $component_label_v2 $job_manager_component_name
            scaleReplicas 1 deployment $component_label_v2 $task_manager_component_name
    
            #scale gateway subchart replicas
            scaleReplicas 1 deployment $component_label_v2 $gw_component_component_name
    
            #scale minio subchart replicas
            scaleReplicas 4 statefulset $component_label_v2 $minio_server_component_name
    
            #scale elasticsearch subchart replicas
            scaleReplicas 1 statefulset $component_label_v2 $elasticsearch_server_component_name
    
        fi
    
        if [ $CONFIG == 'medium' ]; then
            echo "medium config set"
            #scale ai4it-icp chart replicas
            scaleReplicas 1 deployment $component_label_v2 $addon_component_name
            scaleReplicas 2 deployment $component_label_v2 $alert_localization_component_name
            scaleReplicas 2 deployment $component_label_v2 $chatops_orchestrator_component_name
            scaleReplicas 6 deployment $component_label_v2 $chatops_slack_integrator_component_name
            scaleReplicas 1 deployment $component_label_v2 $controller_component_name
            scaleReplicas 2 deployment $component_label_v2 $event_grouping_component_name
            scaleReplicas 2 deployment $component_label_v2 $log_anomaly_detector_component_name
            scaleReplicas 1 deployment $component_label_v2 $mock_server_component_name
            scaleReplicas 1 deployment $component_label_v2 $model_train_console_component_name
            scaleReplicas 2 deployment $component_label_v2 $persistence_component_name
            scaleReplicas 2 deployment $component_label_v2 $similar_incidents_service_component_name
            scaleReplicas 2 deployment $component_label_v2 $topology_component_name
            scaleReplicas 1 statefulset $component_label_v2 $qes_component_name
    
    
            #scale postgres subchart replicas
            scaleReplicas 2 deployment $component_label $postgres_proxy_component_name
            scaleReplicas 3 statefulset $component_label $postgres_component_keeper
            scaleReplicas 3 deployment $component_label $postgres_sentinel_component_name
    
            #scale dlaas subchart replicas
            scaleReplicas 2 deployment $component_label_v2 $lcm_component_name
            scaleReplicas 1 deployment $component_label_v2 $ratelimiter_component_name
            scaleReplicas 2 deployment $component_label_v2 $trainer_v2_component_name
    
            #scale flink subchart replicas
            scaleReplicas 1 deployment $component_label_v2 $job_manager_component_name
            scaleReplicas 2 deployment $component_label_v2 $task_manager_component_name
    
            #scale gateway subchart replicas
            scaleReplicas 1 deployment $component_label_v2 $gw_component_component_name
    
            #scale minio subchart replicas
            scaleReplicas 4 statefulset $component_label_v2 $minio_server_component_name
    
            #scale elasticsearch subchart replicas
            scaleReplicas 1 statefulset $component_label_v2 $elasticsearch_server_component_name
        fi
    
    }
    
    
    main
  3. Save the script as scale.sh
  4. Start the script as needed:
    ./scale.sh -i <instance-name> -c <config: "small" or "medium"> -n <namespace where aiops is installed>

Storage requirements

The service uses datastore and other resources to support its functions. These resources are stored in Portworx storage. The following lists the storage resources that are required to support the persistent volumes that are used by the service.
  • Local - Local storage (not host-local storage) is storage that is allocated to the container as it runs. There is generally not much of it, and it is temporary (when the container terminates so does this storage). AI Manager only uses a few MB for temporary files.
  • Persistent Volume Claims (PVC) PVC can mount several types of storage into a container - such as PortWorx and others. AI Manager uses PVCs during training and quality evaluations. In both cases, the amount used will depend on the size of the train and validation data sets. During training the PVC is used to hold a copy of the decompressed training data set (which can be several TB in the case of logs, events/alerts are typically in the few 100 MB range, and Incidents and Entity mapping is even less, at 10s of MB.). For quality evaluation, a copy of the evaluation data set is kept in the PVC. This again depends on the size of the data set used, but would typically be around 1GB. The client should evaluate the data sets to be used, and allocate the appropriate amount of storage.
  • PostgreSQL - PostgreSQL is used for storing the operation data for AI Manager, such as the stories, anomalies etc. The data used for each story (including all associated data) is small, only a few 100 KB however this will grow as each story is detected. Enough storage should be included to allow for the volume of stories anticipated combined with the retention period.
  • Elastic - Elasticsearch is used to store the training artifacts both for log anomalies and incidents. Both of these are fairly small, typically in the 10s of MB for each version. These will grow for each retraining, so while not much storage will be required, the number of versions being kept should be considered.
  • Minio - Minio is used for long term file storage, as well as temporary storage in between training jobs. This is mostly used for training but as opposed to the PVC, this is for long term storage. The amount used here will again depend heavily on the size of the training data sets. For logs this will consist of 1x the compressed training dataset (typically several 100GB) + 1/2x the uncompressed training dataset (100s of GB or even a few TB). Event Training is 2x the event dataset (typically a few 100 MB), and negligible amounts for the remaining uses (several MB). For each retraining the amount used increases by 1/2x the uncompressed log training dataset + 1x the event dataset.
Given these details, storage requirements are:
  • Small instance: 200 GB Storage. This should be able to handle inference as well as training, of approximately 5 GB Compressed (50 GB uncompressed) data
  • Medium instance: 1.5 TB Storage. This should be able to handle inference as well as training, of approximately 100 GB Compressed (1 TB uncompressed) data.
Note that the storage is heavily dependent on the training data sets.
Important: AI Manager uses ReadWriteOnce for storage, except in cases where storage is required to be shared by multiple pods, in which case S3FS is used to provide ReadWriteMany.
Portworx storage

Before you can use the service, you must create a Portworx storage class named portworx-shared-gp3. For more information, see Creating Portworx storage classes. To use Portworx storage:

  1. Follow the instructions for installing Portworx.
    Note: You must be a cluster administrator to install Portworx in the cluster.
  2. Run the following command to apply the specification that is generated by the Portworx spec generator:
    kubectl apply -f spec.yaml
  3. Define the storage class.
    Create a YAML file in which you define the storage class for the persistent volume.
    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
      name: portworx-aiops
    provisioner: kubernetes.io/portworx-volume
    parameters:
       repl: "3"
       priority_io: "high"
       snap_interval: "0"
       io_profile: "db"
       block_size: "64k"
  4. Push the configuration change to the cluster. Use the apply command in a command with the following syntax:
    kubectl apply -f {pv-yaml-file-name} 
    where {pv-yaml-file-name} is the YAML file you created in the previous step.
  5. Configure AI Manager for IBM Cloud Pak for Data to use the storage. Override the persistent volume storage class setting in the values.yaml file. By default, it is set to use local-storage. You can specify portworx-aiops instead. This class sets the provisioner to kubernetes.io/portworx-volume. Specifically, you must override the following values:
    • cos.minio.persistence.storageClass: portworx-aiops
    • etcd.config.dataPVC.storageClassName: portworx-aiops
    • postgres.config.persistence.storageClassName: portworx-aiops
    • mongodb.config.persistentVolume.storageClass: portworx-aiops
S3FS storage permissions

S3FS can mount a bucket as directory, while preserving the native object format for files, allowing use of other tools. For more information about using S3FS with AI Manager, see the topic Training an Insight model for Watson AIOps AI Manager.

Security reference

PodSecurityPolicy Requirements
Note: This information is provided for reference. The installation will create the security requirements for you.

This chart requires a PodSecurityPolicy to be bound to the target namespace prior to installation. To meet this requirement there may be cluster scoped as well as namespace scoped pre- and post- actions that need to occur.

The predefined PodSecurityPolicy name: ibm-restricted-psp has been verified for this chart; if your target namespace is bound to this PodSecurityPolicy you can proceed to install the chart.

This chart also defines a custom PodSecurityPolicy which can be used to finely control the permissions/capabilities needed to deploy this chart. You can enable this custom PodSecurityPolicy using the IBM Cloud Pak for Data user interface or the supplied instructions/scripts in the pak_extension preinstall directory. From the user interface, you can copy and paste the following code snippets to enable the custom PodSecurityPolicy:

Custom PodSecurityPolicy definition:

apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
  name: ibm-watson-aiops-psp
spec:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: false
  allowedCapabilities:
      - CHOWN
      - DAC_OVERRIDE
      - SETGID
      - SETUID
      - NET_BIND_SERVICE
      seLinux:
        rule: RunAsAny
      supplementalGroups:
        rule: RunAsAny
      runAsUser:
        rule: RunAsAny
      fsGroup:
        rule: RunAsAny
      volumes:
      - configMap
      - secret
Red Hat OpenShift SecurityContextConstraints Requirements
Note: This information is provided for reference. The installation will create the security requirements for you.

This chart requires a SecurityContextConstraints to be bound to the target namespace prior to installation. To meet this requirement there may be cluster scoped as well as namespace scoped pre- and post- actions that need to occur.

The predefined PodSecurityPolicy name: restricted has been verified for this chart; if your target namespace is bound to this SecurityContextConstraints resource you can proceed to install the chart.

This chart also defines a custom SecurityContextConstraints which can be used to finely control the permissions/capabilities needed to deploy this chart. From the user interface, you can copy and paste the following code snippets to enable the custom SecurityContextConstraints:

Custom SecurityContextConstraints definition:

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: ibm-watson-aiops-sccpriority: null
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: MustRunAs
groups:
- system:authenticated
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: MustRunAsRange
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users:
- system:serviceaccount:aiops
volumes:
- configMap
- downwardAPI
- emptyDir
- hostPath
- persistentVolumeClaim
- projected
- secret
- flexVolume