Online upgrade of IBM Cloud Pak for AIOps on Linux

Use these instructions to upgrade an online deployment of IBM Cloud Pak® for AIOps 4.7.0 or later to 4.8.0.

Overview

This procedure is for deployments of IBM Cloud Pak for AIOps on Linux®. This procedure can be used on an online deployment of IBM Cloud Pak for AIOps 4.7.0 or later, and can still be used if the deployment has hotfixes applied.

You cannot use these instructions to upgrade deployments of IBM Cloud Pak for AIOps 4.6.1 or earlier. Upgrade of deployments of IBM Cloud Pak® for AIOps on Linux older than 4.7.0 is not supported. If you have a deployment of IBM Cloud Pak for AIOps older than 4.7.0, then you must uninstall it and then install 4.8.0.

If you have an offline deployment of IBM Cloud Pak for AIOps on Linux, follow the instructions in Offline upgrade of IBM Cloud Pak for AIOps on Linux.

If you have a deployment of IBM Cloud Pak for AIOps on Red Hat® OpenShift® Container Platform, follow the instructions in Upgrading IBM Cloud Pak for AIOps on OpenShift.

Before you begin

Ensure that you meet the following prerequisites:

  • The worker nodes and the client machine that you are running the upgrade from have network connectivity to the control plane nodes.
  • You have the credentials for the root user. Root user must be used to upgrade IBM Cloud Pak for AIOps.

Warnings:

  • Custom patches, labels, and manual adjustments to IBM Cloud Pak for AIOps resources are lost when IBM Cloud Pak for AIOps is upgraded, and must be manually reapplied after upgrade. For more information, see Manual adjustments are not persisted.
  • The upgrade cannot be removed or rolled back.

Upgrade procedure

Follow these steps to upgrade your online IBM Cloud Pak for AIOps deployment.

  1. Ensure cluster readiness
  2. Update the aiopsctl tool on the cluster nodes
  3. Upgrade IBM Cloud Pak for AIOps
  4. Verify your deployment
  5. Post upgrade actions

1. Ensure cluster readiness

Recommended: Take a backup of IBM Cloud Pak for AIOps before you upgrade to v4.8.0. For more information, see Back up and restore (IBM Cloud Pak for AIOps on Linux).

  1. Ensure that your cluster still meets all the prerequisites for deployment. For more information, see Planning an installation of IBM Cloud Pak for AIOps on Linux.

  2. Set environment variables.

    1. Set an environment variable for your deployment mode.

      From IBM Cloud Pak for AIOps 4.8.0, deployment mode is configured with a command line argument instead of with a configuration file.

      Edit the aiops_var.sh file that you created when you installed IBM Cloud Pak for AIOps, and update the DEPLOY_TYPE environment variable to the required value. Set to extended if you have an extended deployment with log anomaly detection and ticket analysis capabilities, or set to base if you have a base deployment without log anomaly detection and ticket analysis capabilities.

    2. Run the following command from the directory that the script is in, to set the environment variables that are used later.

      . ./aiops_var.sh
      

    Note: If you do not still have aiops_var.sh from when you installed IBM Cloud Pak for AIOps, then follow the instructions in Create environment variables.

  3. Delete any evicted connector-orchestrator pods.

    1. Run the following command to check whether there are any evicted connector-orchestrator pods.

      ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} oc get pods -n aiops | grep connector-orchestrator
      
    2. Cleanup any evicted connector-orchestrator pods.

      If the previous command returned any pods with a STATUS of Evicted, then run the following command to delete each of them.

      ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} oc delete pod -n aiops <connector_orchestrator>
      

      Where <connector_orchestrator> is a pod returned by the previous step.

2. Update the aiopsctl tool on the cluster nodes

Run the following commands to update the aiopsctl tool on all the cluster nodes.

Note: When the command to update the first control plane node is run, you are prompted whether you want to accept the upgrade. Subsequent nodes use the -y flag to bypass this prompt.

ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} aiopsctl update --version v4.8.0

echo "Upgrading main control plane node ${CONTROL_PLANE_NODE}"
ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} aiopsctl cluster node up --accept-license=${ACCEPT_LICENSE} --role=control-plane

echo "Upgrading additional control plane nodes"
for CP_NODE in "${ADDITIONAL_CONTROL_PLANE_NODES[@]}"; do
  ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} ssh ${TARGET_USER}@${CP_NODE} aiopsctl update -y --version v4.8.0
  ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} ssh ${TARGET_USER}@${CP_NODE} aiopsctl cluster node up -y --accept-license=${ACCEPT_LICENSE} --role=control-plane
done

echo "Upgrading worker nodes"
for WORKER_NODE in "${WORKER_NODES[@]}"; do
  ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} ssh ${TARGET_USER}@${WORKER_NODE} aiopsctl update -y --version v4.8.0
  ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} ssh ${TARGET_USER}@${WORKER_NODE} aiopsctl cluster node up -y --accept-license=${ACCEPT_LICENSE} --role=worker
done

3. Upgrade IBM Cloud Pak for AIOps

Run the following command to upgrade IBM Cloud Pak for AIOps to v4.8.0.

ssh ${TARGET_USER}@${CONTROL_PLANE_NODE} aiopsctl server up --load-balancer-host="${LOAD_BALANCER_HOST}" --mode "${DEPLOY_TYPE}"

Note: You are prompted whether you want to accept the upgrade.

4. Verify your deployment

The upgrade takes around an hour to complete. If the upgrade is unsuccessful, an error message is displayed and a nonzero exit code is returned.

  1. Check the status of your deployment.

    Run the following command to check the status of the components of your IBM Cloud Pak for AIOps installation:

    aiopsctl status
    

    Example output for a healthy deployment:

    $ aiopsctl status
    o- [12 Aug 24 08:40 PDT] Getting cluster status
    Control Plane Node(s):
        test-server-1.acme.com Ready
        test-server-2.acme.com Ready
        test-server-3.acme.com Ready
    
    Worker Node(s):
        test-agent-1.acme.com Ready
        test-agent-2.acme.com Ready
        test-agent-3.acme.com Ready
        test-agent-4.acme.com Ready
        test-agent-5.acme.com Ready
        test-agent-6.acme.com Ready
        test-agent-7.acme.com Ready
    
    o- [12 Aug 24 08:40 PDT] Checking AIOps installation status
    
      15 Ready Components
        cluster
        aimanager
        lifecycletrigger
        aiopsanalyticsorchestrator
        baseui
        elasticsearchcluster
        aiopsedge
        rediscp
        asm
        zenservice
        aiopsui
        commonservice
        kafka
        issueresolutioncore
        lifecycleservice
    
      AIOps installation healthy
    

  2. Run the following command and check that the VERSION that is returned is 4.8.0.

    aiopsctl server version
    

If the upgrade fails, or is not complete and is not progressing, then see Troubleshooting installation and upgrade and Known Issues to help you identify any installation problems.

5. Post upgrade actions

  1. If you previously took a backup of your deployment, it is recommended that you take a new back up. For more information, see Back up and restore (IBM Cloud Pak for AIOps on Linux).

  2. If the EXPIRY_SECONDS environment variable was set for configuring log anomaly alerts, the environment variable was not retained in the upgrade. After the upgrade is completed, set the environment variable again. For more information about setting the variable, see Configuring expiry time for log anomaly alerts.

  3. If you have a metric integration configured that stops working after upgrade, then you must follow the instructions in After upgrade, a metric integration goes into a failed state.

  4. (Optional) You can use the following steps to remove unnecessary data from your Cloud Pak for AIOps environment:

    Note: Use the following steps if high availability (HA) is enabled for your Cloud Pak for AIOps deployment.

    Run the following commands from the control plane node.

    1. Switch to the project (namespace) where Cloud Pak for AIOps is deployed.

      oc project aiops
      
    2. Verify the health of your Cloud Pak for AIOps deployment:

      oc get installation  -o go-template='$i:=index .items 0range $c,$s := $i.status.componentstatus$c": "$s"\n"end'
      

      All the components need to be in Ready status.

    3. Delete the zookeeper data by running the following four commands:

      oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/ir-lifecycle
      
      oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/ir-lifecycle2
      
      oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/ir-lifecycle3
      
      oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/cp4waiops-eventprocessor
      
    4. Delete the Issue Resolution (IR) lifecycle metadata by running the following three commands:

      img=$(oc get csv -o jsonpath='{.items[?(@.spec.displayName=="IBM AIOps AI Manager")].spec.install.spec.deployments[?(@.name=="aimanager-operator-controller-manager")].spec.template.metadata.annotations.olm\.relatedImage\.opencontent-minio-client}')
      
      minio=$(oc get flinkdeployment aiops-ir-lifecycle-flink -o jsonpath='{.spec.flinkConfiguration.s3\.endpoint}')
      
      oc delete job --ignore-not-found aiops-clean-s3
      cat <<EOF | oc apply --validate -f -
      apiVersion: batch/v1
      kind: Job
      metadata:
      name: aiops-clean-s3
      spec:
        backoffLimit: 6
        parallelism: 1
        template:
         metadata:
           labels:
             component: aiops-clean-s3
           name: clean-s3
         spec:
           affinity:
             nodeAffinity:
               requiredDuringSchedulingIgnoredDuringExecution:
                 nodeSelectorTerms:
                 - matchExpressions:
                   - key: kubernetes.io/arch
                     operator: In
                     values:
                     - amd64
           containers:
           - command:
             - /bin/bash
             - -c
             - |-
               echo "Connecting to Minio server: $minio"
               try=0
               while true; do
                 mc alias set aiopss3 $minio \$(cat /config/accesskey) \$(cat /config/secretkey)
                 if [ \$? -eq 0 ]; then break; fi
                 try=\$(expr \$try + 1)
                 if [ \$try -ge 30 ]; then exit 1; fi
                 sleep 2
               done
               /workdir/bin/mc rm -r --force aiopss3/aiops-ir-lifecycle/high-availability/ir-lifecycle
               x=$?
               /workdir/bin/mc ls aiopss3/aiops-ir-lifecycle/high-availability
               exit \$x
             image: $img
             imagePullPolicy: IfNotPresent
             name: clean-s3
             resources:
               limits:
                 cpu: 500m
                 memory: 512Mi
               requests:
                 cpu: 200m
                 memory: 256Mi
             securityContext:
               allowPrivilegeEscalation: false
               capabilities:
                 drop:
                 - ALL
               privileged: false
               readOnlyRootFilesystem: false
               runAsNonRoot: true
             volumeMounts:
             - name: s3-credentials
               mountPath: /config
             - name: s3-ca
               mountPath: /workdir/home/.mc/certs/CAs
           volumes:
           - name: s3-credentials
             secret:
               secretName: aimanager-ibm-minio-access-secret
           - name: s3-ca
             secret:
               items:
               - key: ca.crt
                 path: ca.crt
               secretName: aimanager-certificate-secret
           restartPolicy: Never
           serviceAccount: aimanager-workload-admin
           serviceAccountName: aimanager-workload-admin
      EOF
      
    5. Check the status of the job:

      oc get po -l component=aiops-clean-s3
      

      Verify that the status shows as Completed.