Prerequisites and requirements

Before you install IBM Watson® AIOps AI Manager, you must meet the following requirements.

Prerequisites

Before you install AI Manager, you must install:

System requirements

When you choose which sizing is appropriate for your implementation, take note that the small instance is for trial purposes. The small instance does not provide production level performance and does not include high availability for the task manager (Apache Flink). High availability is only available for medium and large instances.

This bare minimum setup will not provide production-like performance; however, it will be sufficient for demonstration purposes.

Important: Take care to estimate your Elastic storage needs accurately in advance of installation. After installation, you cannot extend Elastic storage without potentially causing service interruptions in AI Manager and possible data loss. For more information about why Elastic storage is important, see Storage requirements.

Important: Intel CPUs before the fourth-generation architecture (Haswell) do not support AVX2 (Haswell New Instructions). AVX2 is required for the Localization and blast radius service. The Localization and blast radius service does not work on pre-Haswell Intel CPU variants.

Small instance

A small instance of AI Manager is the minimum supported size, and supports a single tenant with no high availability. High availability is not applicable for training. If a job fails due to an internal failure, the job restarts without any impact on the CPU. If a job manager or task manager pod fails, the job does not automatically restart. Data stores are set up with high availability by default. A small instance requires the following resources:

Component CPU Memory (GB)
AI Manager inference 11.8 26.00
AI Manager training 2.9 6.50
Strimzi Kafka 8.0 13.00
Elastic 9.0 9.00
Minio 1.0 8.00
PostgreSQL 0.8 2.00
Model training 0.6 2.25
Totals 34.1 66.75

Notes:

AI Manager inference consists of:

Component Pod Replica
Log Anomaly Pipeline ibm-aiops----ibm-flink-task-manager 1
Log Anomaly Pipeline ibm-aiops----aio-log-anomaly-detector 1
Event Grouping Pipeline ibm-aiops----aio-event-grouping 1
Event Grouping Pipeline ibm-aiops----aio-alert-localization 1
Event Grouping Pipeline ibm-aiops----aio-topology 1
ChatOps ibm-aiops----aio-chatops-orchestrator 1
ChatOps ibm-aiops----aio-chatops-slack-integrator 1
Incident Similarity Pipeline ibm-aiops----aio-similar-incidents-service 1
AI Manager Platform ibm-aiops----aio-controller 1
AI Manager Platform ibm-aiops----aio-flink-zookeeper 3
AI Manager Platform ibm-aiops----ibm-flink-job-manager 2
AI Manager Platform ibm-aiops----aio-mock-server 1
AI Manager Platform ibm-aiops----aio-persistence 1
AI Manager Platform ibm-aiops----aio-core-tests 1
AI Manager Platform ibm-aiops----aio-addon 1
AI Manager Platform ibm-aiops----aio-model-train-console 1
AI Manager Training 1
Elastic ibm-aiops---ib-d0d-* 3
Elastic ibm-aiops----elasticsearch-test-* 1
Minio ibm-aiops----minio-* 3
Minio ibm-aiops----minio-test-* 1
PostgreSQL ibm-aiops----postgres-keeper-* 3
PostgreSQL ibm-aiops----postgres-proxy-* 2
PostgreSQL ibm-aiops---op-55c0-sentinel-* 3
Model train ibm-aiops----ibm-dlaas-lcm-* 2
Model train ibm-aiops----ibm-dlaas-ratelimiter-* 1
Model train ibm-aiops----ibm-dlaas-trainer-v2-* 2
Model train ibm-object-storage-plugin 1
Model train ibmcloud-object-storage-driver 1 per node

A small instance supports the following volume:

Pipeline Component Number of items Size
Logs 4,000 msg/sec 454 Bytes/msg
Netcool Insight® Operations Event groups 5 events/sec -
PagerDuty alerts 5 alerts/sec -
Incident queries 45 queries/min -

Medium instance

A medium scaled deployment consists of, approximately, two times the resources of a small instance; this enables high availability, or can increase throughput. A medium instance requires the following resources:

Component CPU Memory (GB)
AI Manager inference 18.4 40.00
AI Manager training 2.9 6.50
Strimzi Kafka 8.0 13.00
Elastic 9.0 9.00
Minio 1.0 8.00
PostgreSQL 0.8 2.00
Model training 0.6 2.25
Totals 40.7 80.75

Notes:

AI Manager inference consists of:

Component Pod Replica
Log Anomaly Pipeline ibm-aiops----ibm-flink-task-manager 2
Log Anomaly Pipeline ibm-aiops----aio-log-anomaly-detector 2
Event Grouping Pipeline ibm-aiops----aio-event-grouping 2
Event Grouping Pipeline ibm-aiops----aio-alert-localization 2
Event Grouping Pipeline ibm-aiops----aio-topology 2
ChatOps ibm-aiops----aio-chatops-orchestrator 2
ChatOps ibm-aiops----aio-chatops-slack-integrator 2
Incident Similarity Pipeline ibm-aiops----aio-similar-incidents-service 2
AI Manager Platform ibm-aiops----aio-controller 2
AI Manager Platform ibm-aiops----aio-flink-zookeeper 3
AI Manager Platform ibm-aiops----ibm-flink-job-manager 2
AI Manager Platform ibm-aiops----aio-mock-server 1
AI Manager Platform ibm-aiops----aio-persistence 2
AI Manager Platform ibm-aiops----aio-core-tests 1
AI Manager Platform ibm-aiops----aio-addon 2
AI Manager Platform ibm-aiops----aio-model-train-console 1
AI Manager Training 1
Elastic ibm-aiops---ib-d0d-* 3
Elastic ibm-aiops----elasticsearch-test-* 1
Minio ibm-aiops----minio-* 3
Minio ibm-aiops----minio-test-* 1
PostgreSQL ibm-aiops----postgres-keeper-* 3
PostgreSQL ibm-aiops----postgres-proxy-* 2
PostgreSQL ibm-aiops---op-55c0-sentinel-* 3
Model train ibm-aiops----ibm-dlaas-lcm-* 2
Model train ibm-aiops----ibm-dlaas-ratelimiter-* 1
Model train ibm-aiops----ibm-dlaas-trainer-v2-* 2
Model train ibm-object-storage-plugin 1
Model train ibmcloud-object-storage-driver 1 per node

A medium instance supports the following volume:

Pipeline Component Number of items Size
Logs 8,000 msg/sec -
Netcool Insight Operations Event groups 10 events/sec -
PagerDuty alerts 10 alerts/sec -
Incident queries 85 queries/min -

Scaling the services

The highest volume of data is processed by the Data Ingest tasks on log data (running on Apache Flink). Performance of this task is highly dependent on the configuration that is being optimized for the data processed. You must take three pieces of information into account:

For best performance:

Number of apps (in log data) == Number of Kafka Topic Partitions == Number of Parallelism

For medium performance:

Number of apps (in log data) < Number of Kafka Topic Partitions == Number of Parallelism

And for the worst performance:

Number of apps (in log data) > Number of Kafka Topic Partitions < Number of Parallelism

Each instance of Flink (task manager) has 12 slots, each with 256 MB memory. Administrators can use the Parallelism configuration field to increase the number of instances run in parallel. Each of these uses a slot; the number of slots that are required is the sum of all the parallel configuration entries (across all connections). Since a particular replica has 12 slots, if more are required then the number of replicas should be increased to match.

Note: Adding extra slots beyond the numbers that are specified in the parallel parameter will not increase performance because the slots remain idle.

For scaling to Small or Medium instances, you must create and start a scaling script:

  1. Run oc login
  2. Open a shell script editor, copy the shell script in the following section and paste it into your editor.
  3. Save the script as scale.sh.
  4. Start the script as needed:

     ./scale.sh -i <instance-name> -c <config: "small" or "medium"> -n <namespace where aiops is installed>
    

scale.sh script

#!/bin/bash

set -e

USAGE="$(
  cat <<EOF
usage: $0
  [-i | --instance]         Instance name of the aiops installation
  [-n | --namespace]        Namespace where aiops instance is installed
  [-c | --config]           Config of AIOps, can be small medium large
EOF
)"

usage() {
  echo "$USAGE" >&2
  exit 1
}

CURR_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"

while (("$#")); do
  arg="$1"
  case $arg in
  -n | --namespace)
    if [ -n "$2" ] && [ "${2:0:1}" != "-" ]; then
      NAMESPACE=$2
      shift 2
    else
      echo "Error: Argument for $1 is missing" >&2
      exit 1
    fi
    ;;
  -i | --instance)
    if [ -n "$2" ] && [ "${2:0:1}" != "-" ]; then
      INSTANCE_NAME=$2
      shift 2
    else
      echo "Error: Argument for $1 is missing" >&2
      exit 1
    fi
    ;;
  -c | --config)
    if [ -n "$2" ] && [ "${2:0:1}" != "-" ]; then
      CONFIG=$2
      shift 2
    else
      echo "Error: Argument for $1 is missing" >&2
      exit 1
    fi
    ;;
  -h | --help)
    usage
    ;;
  -* | --*=) # unsupported flags
    echo "Error: Unsupported flag $1" >&2
    exit 1
    ;;
  esac
done

check_env_vars() {
    echo "========== Begin test for environment variables =========="

    if [ -z "${NAMESPACE}" ];
    then
        echo ">> NAMESPACE var not set... The -n or --namespace argument is required"
        exit 1
    fi

    if [ -z "${INSTANCE_NAME}" ];
    then
        echo ">>  INSTANCE_NAME var not set... The -i or --instance argument is required"
        exit 1
    fi

    if [ -z "${CONFIG}" ];
    then
        echo ">> CONFIG var not set... The -c or --config argument is required"
        exit 1
    fi
}

check_env_vars

RELEASE="ibm-aiops---${INSTANCE_NAME}"
component_label="component"

component_label_v2="app.kubernetes.io/component"

function scaleReplicas()
{
    echo "scaling $2 with label $3=$4 with replicas $1"
    oc scale --replicas=$1 $2 -n $NAMESPACE -l "${3}=${4},app.kubernetes.io/instance=${RELEASE}" 
}

function main()
{
    #ai4it apps
    addon_component_name="addon"
    alert_localization_component_name="alert-localization"
    chatops_orchestrator_component_name="chatops-orchestrator"
    chatops_slack_integrator_component_name="chatops-slack-integrator"
    controller_component_name="controller"
    event_grouping_component_name="event-grouping"
    log_anomaly_detector_component_name="log-anomaly-detector"
    mock_server_component_name="mock-server"
    model_train_console_component_name="model-train-console"
    persistence_component_name="persistence"
    similar_incidents_service_component_name="similar-incidents-service"
    topology_component_name="topology"
    qes_component_name="qes"

    #postgres Apps
    postgres_sentinel_component_name="stolon-sentinel"
    postgres_proxy_component_name="stolon-proxy"
    postgres_component_keeper="stolon-keeper"

    #dlaas
    lcm_component_name="lcm"
    ratelimiter_component_name="ratelimiter"
    trainer_v2_component_name="trainer-v2"

    #flink
    job_manager_component_name="job-manager"
    task_manager_component_name="task-manager"

    #gateway
    gw_component_component_name="gw-deployment"

    #minio
    minio_server_component_name="server"

    #elasticsearch
    elasticsearch_server_component_name="elasticsearch-server"


    if [ $CONFIG == 'small' ]; then
        echo "small config set"
        #scale ai4it-icp chart replicas
        scaleReplicas 1 deployment $component_label_v2 $addon_component_name
        scaleReplicas 1 deployment $component_label_v2 $alert_localization_component_name
        scaleReplicas 1 deployment $component_label_v2 $chatops_orchestrator_component_name
        scaleReplicas 3 deployment $component_label_v2 $chatops_slack_integrator_component_name
        scaleReplicas 1 deployment $component_label_v2 $controller_component_name
        scaleReplicas 1 deployment $component_label_v2 $event_grouping_component_name
        scaleReplicas 1 deployment $component_label_v2 $log_anomaly_detector_component_name
        scaleReplicas 1 deployment $component_label_v2 $mock_server_component_name
        scaleReplicas 1 deployment $component_label_v2 $model_train_console_component_name
        scaleReplicas 1 deployment $component_label_v2 $persistence_component_name
        scaleReplicas 1 deployment $component_label_v2 $similar_incidents_service_component_name
        scaleReplicas 1 deployment $component_label_v2 $topology_component_name
        scaleReplicas 1 statefulset $component_label_v2 $qes_component_name


        #scale postgres subchart replicas
        scaleReplicas 2 deployment $component_label $postgres_proxy_component_name
        scaleReplicas 3 statefulset $component_label $postgres_component_keeper
        scaleReplicas 3 deployment $component_label $postgres_sentinel_component_name

        #scale dlaas subchart replicas
        scaleReplicas 2 deployment $component_label_v2 $lcm_component_name
        scaleReplicas 1 deployment $component_label_v2 $ratelimiter_component_name
        scaleReplicas 2 deployment $component_label_v2 $trainer_v2_component_name

        #scale flink subchart replicas
        scaleReplicas 1 deployment $component_label_v2 $job_manager_component_name
        scaleReplicas 1 deployment $component_label_v2 $task_manager_component_name

        #scale gateway subchart replicas
        scaleReplicas 1 deployment $component_label_v2 $gw_component_component_name

        #scale minio subchart replicas
        scaleReplicas 4 statefulset $component_label_v2 $minio_server_component_name

        #scale elasticsearch subchart replicas
        scaleReplicas 1 statefulset $component_label_v2 $elasticsearch_server_component_name

    fi

    if [ $CONFIG == 'medium' ]; then
        echo "medium config set"
        #scale ai4it-icp chart replicas
        scaleReplicas 1 deployment $component_label_v2 $addon_component_name
        scaleReplicas 2 deployment $component_label_v2 $alert_localization_component_name
        scaleReplicas 2 deployment $component_label_v2 $chatops_orchestrator_component_name
        scaleReplicas 6 deployment $component_label_v2 $chatops_slack_integrator_component_name
        scaleReplicas 1 deployment $component_label_v2 $controller_component_name
        scaleReplicas 2 deployment $component_label_v2 $event_grouping_component_name
        scaleReplicas 2 deployment $component_label_v2 $log_anomaly_detector_component_name
        scaleReplicas 1 deployment $component_label_v2 $mock_server_component_name
        scaleReplicas 1 deployment $component_label_v2 $model_train_console_component_name
        scaleReplicas 2 deployment $component_label_v2 $persistence_component_name
        scaleReplicas 2 deployment $component_label_v2 $similar_incidents_service_component_name
        scaleReplicas 2 deployment $component_label_v2 $topology_component_name
        scaleReplicas 1 statefulset $component_label_v2 $qes_component_name


        #scale postgres subchart replicas
        scaleReplicas 2 deployment $component_label $postgres_proxy_component_name
        scaleReplicas 3 statefulset $component_label $postgres_component_keeper
        scaleReplicas 3 deployment $component_label $postgres_sentinel_component_name

        #scale dlaas subchart replicas
        scaleReplicas 2 deployment $component_label_v2 $lcm_component_name
        scaleReplicas 1 deployment $component_label_v2 $ratelimiter_component_name
        scaleReplicas 2 deployment $component_label_v2 $trainer_v2_component_name

        #scale flink subchart replicas
        scaleReplicas 1 deployment $component_label_v2 $job_manager_component_name
        scaleReplicas 2 deployment $component_label_v2 $task_manager_component_name

        #scale gateway subchart replicas
        scaleReplicas 1 deployment $component_label_v2 $gw_component_component_name

        #scale minio subchart replicas
        scaleReplicas 4 statefulset $component_label_v2 $minio_server_component_name

        #scale elasticsearch subchart replicas
        scaleReplicas 1 statefulset $component_label_v2 $elasticsearch_server_component_name
    fi

}

main

Storage requirements

The service uses data store and other resources to support its functions. These resources are stored in Portworx storage. The following lists the storage resources that are required to support the persistent volumes that are used by the service.

Storage resource Description
Local Local storage (not host-local storage) is storage that is allocated to the container as it runs. There is generally not much of it, and it is temporary (when the container terminates so does this storage). AI Manager uses a few MB for temporary files.
Persistent Volume Claims (PVC) PVC can mount several types of storage into a container - such as PortWorx and others. AI Manager uses PVCs during training and quality evaluations. In both cases, the amount that is used depends on the size of the train and validation data sets. During training the PVC is used to hold a copy of the decompressed training data set (which can be several TB for logs, events or alerts are typically around 100 MB, and Incidents and Entity mapping is even less, at 10s of MB.). For quality evaluation, a copy of the evaluation data set is kept in the PVC. This again depends on the size of the data set used, but would typically be around 1GB. The client should evaluate the data sets to be used, and allocate the appropriate amount of storage.
PostgreSQL PostgreSQL is used for storing the operation data for AI Manager, such as the stories, anomalies, and so on. The data used for each story (including all associated data) is small, only a few 100 KB however this will grow as each story is detected. Enough storage should be included to allow for the volume of stories anticipated combined with the retention period.
Elastic Elasticsearch is used to store the training artifacts both for log anomalies and incidents. Both of these are fairly small, typically in the 10s of MB for each version. These will grow for each retraining, so while not much storage will be required, the number of versions being kept should be considered.
MinIO MinIO is used for long term file storage, as well as temporary storage in between training jobs. This is mostly used for training but as opposed to the PVC, this is for long term storage. The amount used here will again depend heavily on the size of the training data sets. For logs this will consist of 1x the compressed training data set (typically several 100GB) + 1/2x the uncompressed training dataset (100s of GB or even a few TB). Event Training is 2x the event dataset (typically a few 100 MB), and negligible amounts for the remaining uses (several MB). For each retraining the amount used increases by 1/2x the uncompressed log training dataset + 1x the event dataset.

Given these details, storage requirements are:

Storage is dependent on the training data sets.

Important: AI Manager uses ReadWriteOnce for storage, except in cases where storage is required to be shared by multiple pods, in which case S3FS is used to provide ReadWriteMany.

Portworx storage

Before you can use the service, you must create a Portworx storage class named portworx-shared-gp3. For more information, see Creating Portworx storage classes. To use Portworx storage:

  1. Follow the instructions for installing Portworx.

    Note: You must be a cluster administrator to install Portworx in the cluster.

  2. Run the following command to apply the specification that is generated by the Portworx spec generator:

     kubectl apply -f spec.yaml
    
  3. Define the storage class.

    Create a YAML file in which you define the storage class for the persistent volume.

     kind: StorageClass
     apiVersion: storage.k8s.io/v1
     metadata:
       name: portworx-aiops
     provisioner: kubernetes.io/portworx-volume
     parameters:
        repl: "3"
        priority_io: "high"
        snap_interval: "0"
        io_profile: "db"
        block_size: "64k"
    
  4. Push the configuration change to the cluster. Use the apply command in a command with the following syntax:

     kubectl apply -f <pv-yaml-file-name>
    

    where <pv-yaml-file-name> is the YAML file that you created in the previous step.

  5. Configure AI Manager to use the storage. Override the persistent volume storage class setting in the values.yaml file. By default, it is set to use local-storage. You can specify portworx-aiops instead. This class sets the provisioner to kubernetes.io/portworx-volume. Specifically, you must override the following values:

S3FS storage permissions

S3FS can mount a bucket as a directory while preserving the native object format for files. This frees up the use of other tools. For more information about using S3FS with AI Manager, see SettingTraining an Insight model for Watson AIOps AI Manager.

Security reference

PodSecurityPolicy Requirements

Note: This information is provided for reference. The installation creates the security requirements for you.

This chart requires a PodSecurityPolicy to be bound to the target namespace before installation. To meet this requirement, there might be cluster-scoped and namespace-scoped pre-actions and post-actions that must occur.

The predefined PodSecurityPolicy name: ibm-restricted-psp has been verified for this chart; if your target namespace is bound to this PodSecurityPolicy, you can proceed to install the chart.

This chart also defines a custom PodSecurityPolicy which can be used to finely control the permissions and capabilities that are needed to deploy this chart. You can enable this custom PodSecurityPolicy by using the IBM Cloud Pak for Data user interface or the supplied instructions and scripts in the pak_extension preinstall directory. From the user interface, you can copy and paste the following code snippets to enable the custom PodSecurityPolicy:

Custom PodSecurityPolicy definition:

apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
  name: ibm-watson-aiops-psp
spec:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: false
  allowedCapabilities:
      - CHOWN
      - DAC_OVERRIDE
      - SETGID
      - SETUID
      - NET_BIND_SERVICE
      seLinux:
        rule: RunAsAny
      supplementalGroups:
        rule: RunAsAny
      runAsUser:
        rule: RunAsAny
      fsGroup:
        rule: RunAsAny
      volumes:
      - configMap
      - secret

Red Hat OpenShift SecurityContextConstraints Requirements

Note: This information is provided for reference. The installation creates the security requirements for you.

This chart requires a SecurityContextConstraints to be bound to the target namespace before installation. To meet this requirement, there might be cluster-scoped and namespace-scoped pre-actions and post-actions that must occur.

The predefined PodSecurityPolicy name: restricted has been verified for this chart; if your target namespace is bound to this SecurityContextConstraints resource, you can proceed to install the chart.

This chart also defines a custom SecurityContextConstraints that can be used to finely control the permissions or capabilities needed to deploy this chart. From the user interface, you can copy and paste the following code snippets to enable the custom SecurityContextConstraints:

Custom SecurityContextConstraints definition:

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: ibm-watson-aiops-sccpriority: null
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: MustRunAs
groups:
- system:authenticated
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: MustRunAsRange
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users:
- system:serviceaccount:aiops
volumes:
- configMap
- downwardAPI
- emptyDir
- hostPath
- persistentVolumeClaim
- projected
- secret
- flexVolume