Restarting the environment (IBM Cloud Pak for AIOps on OpenShift)

Learn how to shutdown and restart the Red Hat OpenShift cluster where IBM Cloud Pak for AIOps is deployed.

Overview

Use this procedure before a known maintenance window or outage to shut down the Red Hat OpenShift cluster where IBM Cloud Pak for AIOps is installed, and to restart the cluster and workloads afterward.

Warning: If you need to shut down the cluster where IBM Cloud Pak for AIOps is installed, then you must use the following procedure. Failure to do so can result in data loss or corruption.

Procedure

  1. Validate the installation
  2. Check the certificates
  3. Prepare to scale down
  4. Scale down the workloads and drain the nodes
  5. Shut down the cluster
  6. Restart the cluster
  7. Scale up the workloads
  8. Validate the installation

1. Validate the installation

  1. Set environment variables.

    Make a note of these environment variables or save them to a file, as you will need to export them again after you restart your cluster.

    export AIOPS_NAMESPACE=<project>
    export AIOPS_INSTANCE=$(oc get installation -o jsonpath='{.items[0].metadata.name}' -n ${AIOPS_NAMESPACE})
    

    Where <project> is the namespace (project) that your IBM Cloud Pak for AIOps installation is deployed in.

  2. Run the describe command:

    oc describe installations.orchestrator.aiops.ibm.com -n "${AIOPS_NAMESPACE}"
    

    Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.

    Example output:

    Name:         ibm-cp-aiops
    Namespace:    aiops
    API Version:  orchestrator.aiops.ibm.com/v1alpha1
    Kind:         Installation
    Spec:
    ...
    Status:
    Componentstatus:
       Aimanager:                       Ready
       Aiopsanalyticsorchestrator:      Ready
       Aiopsedge:                       Ready
       Aiopsui:                         Ready
       Asm:                             Ready
       Baseui:                          Ready
       Cluster:                         Ready
       Commonservice:                   Ready
       Elasticsearchcluster:            Ready
       Flinkdeployment:                 Ready
       Issueresolutioncore:             Ready
       Kafka:                           Ready
       Lifecycleservice:                Ready
       Lifecycletrigger:                Ready
       Rediscp:                         Ready
       Tunnel:                          Ready
       Zenservice:                      Ready
    Phase:                   Running
    

2. Check the certificates

Ensure that none of the certificates have problems or are expired.

Run the following command:

while read l; do echo "$l" | grep '^NAME' || (n=$(echo $l | sed 's/ .*//'); s=$(echo $l | sed 's/^[^ ]* *\([^ ]*\).*/\1/'); x=$(oc get secret -n $n $s -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate 2>/dev/null | sed 's!notAfter=!!'); echo "$l" | sed 's![^ ][^ ]*$!'"$x"'!'); done< <(oc get secret -A --field-selector=type==kubernetes.io/tls -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,EXPIRY:.metadata.name)

Example output excerpt:

ibm-licensing   ibm-license-service-cert                                       Jan  8  13:32:07  2025  GMT
ibm-licensing   ibm-license-service-cert-internal                              Jan  7  13:31:12  2026  GMT
ibm-licensing   ibm-licensing-service-prometheus-cert                          Jan  7  13:31:25  2026  GMT
cp4aiops        aimanager-aio-log-anomaly-feedback-learning-cert               Jan  7  14:01:43  2026  GMT
cp4aiops        aimanager-aio-log-anomaly-golden-signals-cert                  Jan  7  14:01:43  2026  GMT
cp4aiops        aimanager-aio-oob-recommended-actions-cert                     Jan  7  14:01:43  2026  GMT
<...>

Renew or re-create any certificates that have problems, are expired, or will expire before the cluster is restarted. For more information about certificate management, see Renew or re-create certificates in Openshift 4.x Opens in a new tab.

3. Prepare to scale down

  1. Cordon the worker nodes.

    Run the following command for each of the worker nodes:

    oc adm cordon <node>
    

    Where <node> is the name of the node to cordon.

  2. Make a note of the number of replicas.

    1. Make a note of the number of replicas for each StatefulSet.

      oc get statefulsets -n ${AIOPS_NAMESPACE}
      

      Example output:

      NAME                                               READY   AGE
      aimanager-ibm-minio                                1/1     18m
      aiops-ir-analytics-spark-worker                    2/2     33m
      aiops-ir-core-ncobackup                            1/1     37m
      aiops-ir-core-ncoprimary                           1/1     39m
      aiops-topology-cassandra                           1/1     43m
      c-example-couchdbcluster-m                         1/1     40m
      aiops-ibm-elasticsearch-es-server-all              1/1     49m
      ibm-cp-aiops-redis-server                          3/3     45m
      zen-minio                                          3/3     40m
      

      Note:

      • If you do not have a IBM® Netcool® Operations Insight® probe integration, then aiops-ir-core-ncobackup and aiops-ir-core-ncoprimary has zero replicas.
      • If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb StatefulSet.
    2. Make a note of the number of replicas for each StrimziPodSet.

      oc get strimzipodset -n ${AIOPS_NAMESPACE}
      

      Example output:

      NAME                                  PODS   READY PODS   CURRENT PODS   AGE
      iaf-system-kafka                      3      3            3              13d
      iaf-system-zookeeper                  3      3            3              13d
      

    3. Make a note of the number of replicas for each Flink deployment by using the following command:

      oc get deployment | grep flink | grep -v "operator"
      

      Example output:

      NAME                                              PODS   READY PODS   CURRENT PODS   AGE
      aiops-ir-lifecycle-flink                          1/1     1            1             137m
      aiops-ir-lifecycle-flink-taskmanager              1/1     1            1             137m
      aiops-lad-flink                                   1/1     1            1             139m
      aiops-lad-flink-taskmanager                       2/2     2            2             139m
      

      Notes:

      • If you have a base deployment, then aiops-lad-flink and aiops-lad-flink-taskmanager do not show in the preceding output.
      • If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb StatefulSet.

4. Scale down the workloads and drain the nodes

  1. Scale down the operator deployments in the IBM Cloud Pak for AIOps namespace.

    oc scale deployment -l olm.owner.kind=ClusterServiceVersion -n ${AIOPS_NAMESPACE} --replicas=0
    

    Run the following command to check that the number of replicas for each of the operator deployments is now 0.

    oc get deployment -n ${AIOPS_NAMESPACE} -l olm.owner.kind=ClusterServiceVersion
    

    Example output:

    NAME                                                  READY   UP-TO-DATE   AVAILABLE   AGE
    aimanager-operator-controller-manager                 0/0     0            0           47m
    aiopsedge-operator-controller-manager                 0/0     0            0           47m
    asm-operator                                          0/0     0            0           47m
    flink-kubernetes-operator                             0/0     0            0           54m
    ibm-aiops-orchestrator-controller-manager             0/0     0            0           58m
    ibm-common-service-operator                           0/0     0            0           56m
    ibm-commonui-operator                                 0/0     0            0           53m
    ibm-elasticsearch-operator-ibm-es-controller-manager  0/0     0            0           54m
    ibm-events-operator-v5.0.1                            0/0     0            0           54m
    ibm-iam-operator                                      0/0     0            0           54m
    ibm-ir-ai-operator-controller-manager                 0/0     0            0           47m
    ibm-redis-cp-operator                                 0/0     0            0           49m
    ibm-secure-tunnel-operator                            0/0     0            0           48m
    ibm-watson-aiops-ui-operator-controller-manager       0/0     0            0           48m
    ibm-zen-operator                                      0/0     0            0           54m
    ir-core-operator-controller-manager                   0/0     0            0           47m
    ir-lifecycle-operator-controller-manager              0/0     0            0           47m
    operand-deployment-lifecycle-manager                  0/0     0            0           55m
    postgresql-operator-controller-manager-1-18-12        0/0     0            0           54m
    

    Note: If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb-operator deployment.

  2. Scale down the StatefulSets that you noted in step 3.2.

    You can use the Cloud Pak for AIOps console, or create a shell script to do this.

    If you have a base deployment, then remove the following lines from the example shell script:

    oc scale deployment aiops-lad-flink --replicas=0 -n ${AIOPS_NAMESPACE}
    oc scale deployment aiops-lad-flink-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}
    

    Note:

    • If you have a base deployment, then aiops-lad-flink and aiops-lad-flink-taskmanager do not show in the preceding output.
    • If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb StatefulSet.

    If you upgraded from an earlier version of IBM Cloud Pak for AIOps, then add the following line to the example shell script:

    oc scale statefulsets icp-mongodb --replicas=0 -n ${AIOPS_NAMESPACE}
    

    Example shell script:

    #!/bin/bash
    
    oc scale statefulsets aimanager-ibm-minio --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets ${AIOPS_INSTANCE}-redis-server --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets aiops-ir-analytics-spark-worker --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets aiops-ir-core-ncobackup --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets aiops-ir-core-ncoprimary --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale deployment aiops-ir-lifecycle-flink --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale deployment aiops-ir-lifecycle-flink-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets aiops-topology-cassandra --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets c-example-couchdbcluster-m --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale deployment aiops-lad-flink --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale deployment aiops-lad-flink-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets -l app.kubernetes.io/managed-by-ibm-elasticsearch --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    oc scale statefulsets zen-minio --replicas=0 -n ${AIOPS_NAMESPACE}
    sleep 2
    

    Run the following command to check that the number of replicas for each of the StatefulSets is now 0.

    oc get statefulsets -n ${AIOPS_NAMESPACE}
    

    Example output:

    NAME                                               READY   AGE
    aimanager-ibm-minio                                0/0     112m
    aiops-ir-analytics-spark-worker                    0/0     128m
    aiops-ir-core-ncobackup                            0/0     131m
    aiops-ir-core-ncoprimary                           0/0     133m
    aiops-topology-cassandra                           0/0     138m
    c-example-couchdbcluster-m                         0/0     134m
    aiops-ibm-elasticsearch-es-server-all              0/0     143m
    ibm-cp-aiops-redis-server                          0/0     140m
    zen-minio                                          0/0     134m
    

  3. Run the following command to check that the number of replicas for each of the Flink Deployments is now 0.

    oc get deployments -n ${AIOPS_NAMESPACE} | grep flink | grep -v "operator"
    

    Example output:

    NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
    aiops-ir-lifecycle-flink                          0/0     0            0           137m
    aiops-ir-lifecycle-flink-taskmanager              0/0     0            0           137m
    aiops-lad-flink                                   0/0     0            0           139m
    aiops-lad-flink-taskmanager                       0/0     0            0           139m
    

  4. Shutdown the Kafka and ZooKeeper pods.

    oc delete pod -l ibmevents.ibm.com/name=iaf-system-kafka -n ${AIOPS_NAMESPACE}
    oc delete pod -l ibmevents.ibm.com/name=iaf-system-zookeeper -n ${AIOPS_NAMESPACE}
    

    Run the following command to check that the Kafka and ZooKeeper pods have successfully shutdown. If the shutdown is complete, no pods are returned.

    oc get pod -l ibmevents.ibm.com/controller=strimzipodset -n ${AIOPS_NAMESPACE}
    
  5. Scale down the PostgreSQL pods.

    When shutting down a PostgreSQL cluster, it is best to remove the primary replica last. The following script removes each database replica in the cluster with the primary removed last.

    Before running the script, replace <project> with the namespace (project) that your IBM Cloud Pak for AIOps installation is deployed in.

    #!/bin/bash
    
    AIOPS_NAMESPACE=<project>
    
    # Get array of Postgres clusters
    CLUSTERS=($(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" -o go-template='{{range .items}}{{.metadata.name}}{{" "}}{{end}}'))
    
    # For each Postgres cluster, shutdown primary last
    for cluster_name in "${CLUSTERS[@]}"; do
        primary=$(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{.status.currentPrimary}}')
        instances=($(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{range .status.instanceNames}}{{print . " "}}{{end}}'))
        for instance_name in "${instances[@]}"; do
            # Shutdown non-primary replicas
            if [ "${instance_name}" != "${primary}" ]; then
                oc delete pod -n "${AIOPS_NAMESPACE}" "${instance_name}" --ignore-not-found
            fi
        done
    
        # Shutdown the primary once all other replicas are down
        oc delete pod -n "${AIOPS_NAMESPACE}" "${primary}" --ignore-not-found
    done
    

    Wait for all the Postgres pods to be deleted. All pods are deleted when the following command returns no pods:

    oc get pod -l k8s.enterprisedb.io/podRole=instance -n ${AIOPS_NAMESPACE}
    
  6. (Optional) After the StatefulSets and StrimziPodSets are scaled down, drain the worker nodes.

    Run the following command for each of the worker nodes:

    oc adm drain <node>
    

    Where <node> is the name of the node to drain.

    Note: Some pods, such as storage pods, do not stop because this would violate the disruption budget. If this problem occurs, run the commands in each node until only the storage pods are left, and then stop the command and drain the next node.

5. Shut down the cluster

  1. Shut down all the worker nodes on the cluster.

  2. Shut down all the master nodes on the cluster.

  3. Shut down the API node on the cluster.

For more information about shutting down your cluster nodes, see step 4 in the Red Hat OpenShift documentation Shutting down a cluster gracefully Opens in a new tab.

6. Restart the cluster

  1. Re-export the environment variables that you saved in step 1.1.

  2. Restart the cluster nodes in the following order:

    1. Restart the API node.

    2. Restart the master nodes. Check whether all master nodes are in ready status by running the following command:

      oc get nodes
      
    3. Restart the worker nodes. Check whether all worker nodes are in ready status by running the following command:

      oc get nodes
      
  3. After all the nodes are up, uncordon the worker nodes.

    Run the following command for each of the worker nodes:

    oc adm uncordon <node>
    

    Where <node> is the name of the node to uncordon.

7. Scale up the workloads

Scaling up the workloads in the following order helps to minimize startup time and resource contention issues.

  1. Scale the events operator back up.

    oc scale deployment --replicas=1 $(oc get deployment -o custom-columns=NAME:.metadata.name --no-headers -n ${AIOPS_NAMESPACE}  | grep '^ibm-events-operator-') -n ${AIOPS_NAMESPACE}
    
  2. Check whether the Kafka and Zookeeper pods are running again. This can take a few minutes.

    oc get pod -l ibmevents.ibm.com/controller=strimzipodset -n ${AIOPS_NAMESPACE}
    

    Example output when the Kafka and Zookeeper pods are running:

    NAME                                    READY   STATUS    RESTARTS   AGE
    iaf-system-kafka-0                      1/1     Running   0          13d
    iaf-system-kafka-1                      1/1     Running   0          13d
    iaf-system-kafka-2                      1/1     Running   0          13d
    iaf-system-zookeeper-0                  1/1     Running   0          13d
    iaf-system-zookeeper-1                  1/1     Running   0          13d
    iaf-system-zookeeper-2                  1/1     Running   0          13d
    
  3. You need to scale up each of the StatefulSets to the number of replicas as noted in step 3.2.

    1. Scale up Cassandra, Elasticsearch, and Spark StatefulSets in the following order:

      • aiops-topology-cassandra
      • aiops-ibm-elasticsearch-es-server-all
      • aiops-ir-analytics-spark-worker

      Run the following command to scale up each StatefulSet:

      oc scale statefulsets <statefulset> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}
      

      Where:

      • <statefulset> is the StatefulSet to be scaled up
      • <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to

      For example,

      oc scale statefulsets aiops-topology-cassandra --replicas=1 -n cp4aiops
      

      Note: If IBM Cloud Pak for AIOps is deployed on a multi-zone architecture, then there are multiple Elasticsearch StatefulSets with numbered zone names. Each Elasticsearch StatefulSet must be scaled up during this step.

    2. Scale the Flink deployments in the following order:

      • aiops-ir-lifecycle-flink
      • aiops-ir-lifecycle-flink-taskmanager
      • aiops-lad-flink
      • aiops-lad-flink-taskmanager

      Note: If you have a base deployment, then do not scale up aiops-lad-flink and aiops-lad-flink-taskmanager.

      Run the following command to scale up the Flink deployments:

      oc scale deployment <flink_deployment> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}
      

      Where <flink_deployment> is the name of the Flink deployment.

    3. Scale the following StatefulSets in the specified order:

      • aiops-ir-core-ncoprimary
      • aiops-ir-core-ncobackup
      • c-example-couchdbcluster-m
      • ${AIOPS_INSTANCE}-redis-server
      • aimanager-ibm-minio
      • zen-minio

      Run the following command to scale up each StatefulSet:

      oc scale statefulsets <statefulset> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}
      

      Where:

      • <statefulset> is the StatefulSet to be scaled up
      • <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to
  4. Scale up the operator deployments.

    oc scale deployment -l olm.owner.kind=ClusterServiceVersion -n ${AIOPS_NAMESPACE} --replicas=1
    

8. Validate the installation

Note: After a complete cluster restart, it might take approximately an hour for the installation to start running again.

Run the describe command:

oc describe installations.orchestrator.aiops.ibm.com -n "${AIOPS_NAMESPACE}"

Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.