Restarting the environment (IBM Cloud Pak for AIOps on Linux)

Learn how to shutdown and restart the Linux cluster where IBM Cloud Pak for AIOps is deployed.

Overview

Use this procedure before a known maintenance window or outage to shut down the Linux cluster where IBM Cloud Pak for AIOps is installed, and to restart the cluster and workloads afterward.

Warning: If you need to shut down the cluster where IBM Cloud Pak for AIOps is installed, then you must use the following procedure. Failure to do so can result in data loss or corruption.

1. Validate the installation

Run the describe command:

kubectl describe installations.orchestrator.aiops.ibm.com -n aiops

Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.

Example output:

Name:         aiops-installation
Namespace:    aiops
API Version:  orchestrator.aiops.ibm.com/v1alpha1
Kind:         Installation
Spec:
...
Status:
Componentstatus:
   Aimanager:                            Ready
   Aiopsanalyticsorchestrator:           Ready
   Aiopsedge:                            Ready
   Aiopsui:                              Ready
   Asm:                                  Ready
   Baseui:                               Ready
   Cassandra:                            Ready
   cluster.aiops-orchestrator-postgres:  Ready
   cluster.opensearch:                   Ready
   Commonservice:                        Ready
   Flinkdeployment:                      Ready
   Issueresolutioncore:                  Ready
   Kafka:                                Ready
   Lifecycleservice:                     Ready
   Lifecycletrigger:                     Ready
   Rediscp:                              Ready
   Zenservice:                           Ready
   Zookeeper:                            Ready
Phase:                   Running
Note: The Zookeeper component is only on production-sized deployments.

2. Check the certificates

Ensure that none of the certificates have problems or are expired.

Run the following command:

while read l; do echo "$l" | grep '^NAME' || (n=$(echo $l | sed 's/ .*//'); s=$(echo $l | sed 's/^[^ ]* *\([^ ]*\).*/\1/'); x=$(kubectl get secret -n $n $s -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate 2>/dev/null | sed 's!notAfter=!!'); echo "$l" | sed 's![^ ][^ ]*$!'"$x"'!'); done< <(kubectl get secret -A --field-selector=type==kubernetes.io/tls -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,EXPIRY:.metadata.name)
Example output:
NAMESPACE   NAME                                      EXPIRY
cp4aiops    aimanager-certificate-secret              Feb 22 17:15:20 2026 GMT
cp4aiops    aiops-appconnect-ir-secret                Feb 22 16:48:00 2026 GMT
cp4aiops    aiops-ir-analytics-classifier-tls         Feb 22 17:13:13 2026 GMT
cp4aiops    aiops-ir-analytics-metric-api-tls         Feb 22 17:13:17 2026 GMT
cp4aiops    aiops-ir-analytics-metric-spark-tls       Feb 22 17:13:10 2026 GMT
cp4aiops    aiops-ir-analytics-postgres-client-cert   Feb 22 16:48:05 2026 GMT
cp4aiops    aiops-ir-analytics-postgres-server-cert   Feb 22 16:48:44 2026 GMT

Renew or re-create any certificates that have problems, are expired, or will expire before the cluster is restarted.

3. Prepare to scale down

  1. Cordon all of the worker and control plane nodes.

    From a control plane node, run the following command for each of the worker and control plane nodes:

    kubectl cordon <node>
    

    Where <node> is the name of the node to cordon.

  2. Make a note of the number of replicas.

    1. Make a note of the number of replicas for each StatefulSet.

      kubectl get statefulsets -n aiops
      
      Example output:
      NAME                                   READY   AGE
      aimanager-ibm-minio                      5/5     16d
      aiops-ir-analytics-cluster1-spark-worker 2/2     16d
      aiops-ir-core-ncobackup                  1/1     16d
      aiops-ir-core-ncoprimary                 1/1     16d
      aiops-topology-cassandra                 3/3     16d
      aiops-zookeeper                          3/3     15d
      c-example-couchdbcluster-m               3/3     16d
      ibm-cp-aiops-redis-server                3/3     15d
      zen-minio                                3/3     16d
      Note:
      • The aiops-zookeeper pods do not exist on starter size deployments.
      • If you do not have a IBM Netcool Operations Insight probe integration, then aiops-ir-core-ncobackup and aiops-ir-core-ncoprimary has zero replicas.
      • If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb StatefulSet.
    2. Make a note of the number of replicas for each StrimziPodSet.

      kubectl get strimzipodset -n aiops
      

      Example output:

      NAME                                  PODS   READY PODS   CURRENT PODS   AGE
      iaf-system-controller                 3      3            3              13d
      iaf-system-kafka                      3      3            3              13d
    3. Make a note of the number of replicas for each Flink deployment by using the following command:

      kubectl get deployment | grep flink | grep -v "operator"
      

      Example output:

      NAME                                              PODS   READY PODS   CURRENT PODS   AGE
      aiops-ir-lifecycle-flink                          1/1     1            1             137m
      aiops-ir-lifecycle-flink-taskmanager              1/1     1            1             137m
      aiops-lad-flink                                   1/1     1            1             139m
      aiops-lad-flink-taskmanager                       2/2     2            2             139m
      Note:
      • If you have a base deployment , then aiops-lad-flink and aiops-lad-flink-taskmanager do not show in the preceding output.
      • If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb StatefulSet.

4. Scale down the workloads and drain the nodes

  1. Quiesce the OpenSearch cluster.

    kubectl patch clusters.opensearch.cloudpackopen.ibm.com aiops-opensearch -n aiops --type=json -p='[{"op":"replace","path":"/spec/quiesce","value":true}]'
    

    Run the following command to verify that all OpenSearch pods are removed and the OpenSearch cluster is quiesced.

    kubectl get pod -l cluster.opensearch.cloudpackopen.ibm.com=aiops-opensearch -n aiops
    

    Example output:

    No resources found in aiops namespace.
    
  2. Scale down the operator deployments in the IBM Cloud Pak for AIOps namespace.

    1. Run the following command to create two scripts named aiops-operator-scale-down.sh and aiops-operator-scale-up.sh. The scripts get the current replica count for all deployments, and then scale the replicas up or down.

      kubectl get deploy -n aiops -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n aiops %s --replicas=0\n" .metadata.name }}{{end}}' > aiops-operator-scale-down.sh
      
      kubectl get deploy -n aiops -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n aiops %s --replicas=%d\n" .metadata.name .spec.replicas }}{{end}}' > aiops-operator-scale-up.sh
      
    2. Scale down the operators.

      Run the following commands:

      chmod +x ./aiops-operator-scale-down.sh
      ./aiops-operator-scale-down.sh
      
    3. Run the following command to check that the number of replicas for each of the operator deployments is now 0.

      kubectl get deployment -n aiops -l olm.owner.kind=ClusterServiceVersion
      
      Example output:
      NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
      aimanager-operator-controller-manager             0/0     0            0           47m
      aiopsedge-operator-controller-manager             0/0     0            0           47m
      asm-operator                                      0/0     0            0           47m
      iaf-flink-operator-controller-manager             0/0     0            0           54m
      ibm-aiops-orchestrator-controller-manager         0/0     0            0           58m
      ibm-common-service-operator                       0/0     0            0           56m
      ibm-commonui-operator                             0/0     0            0           53m
      ibm-opensearch-operator-controller-manager        0/0     0            0           54m
      ibm-events-cluster-operator-v6.0.0                0/0     0            0           54m
      ibm-iam-operator                                  0/0     0            0           54m
      ibm-ir-ai-operator-controller-manager             0/0     0            0           47m
      ibm-redis-cp-operator                             0/0     0            0           49m
      ibm-secure-tunnel-operator                        0/0     0            0           48m
      ibm-watson-aiops-ui-operator-controller-manager   0/0     0            0           48m
      ibm-zen-operator                                  0/0     0            0           54m
      ir-core-operator-controller-manager               0/0     0            0           47m
      ir-lifecycle-operator-controller-manager          0/0     0            0           47m
      operand-deployment-lifecycle-manager              0/0     0            0           55m
      postgresql-operator-controller-manager-1-18-12    0/0     0            0           54m
      
      Note: If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb-operator deployment.
  3. Scale down the StatefulSets that you noted in step 3.2.

    You can use the Cloud Pak for AIOps console, or create a shell script to do this.

    If you have a base deployment , then remove the following lines from the example shell script:

    kubectl scale deployment aiops-lad-flink --replicas=0 -n aiops
    kubectl scale deployment aiops-lad-flink-taskmanager --replicas=0 -n aiops
    
    Note:
    • If you have a base deployment , then aiops-lad-flink and aiops-lad-flink-taskmanager do not show in the preceding output.
    • If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an icp-mongodb StatefulSet.

    If you upgraded from an earlier version of IBM Cloud Pak for AIOps, then add the following line to the example shell script:

    kubectl scale statefulsets icp-mongodb --replicas=0 -n aiops
    

    Example shell script:

    #!/bin/bash
    
    kubectl scale statefulsets aimanager-ibm-minio --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets aiops-installation-redis-server --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets aiops-ir-analytics-cluster1-spark-worker --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets aiops-ir-core-ncobackup --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets aiops-ir-core-ncoprimary --replicas=0 -n aiops
    sleep 2
    kubectl scale deployment aiops-ir-lifecycle-flink --replicas=0 -n aiops
    sleep 2
    kubectl scale deployment aiops-ir-lifecycle-flink-taskmanager --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets aiops-topology-cassandra --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets c-example-couchdbcluster-m --replicas=0 -n aiops
    sleep 2
    kubectl scale deployment aiops-lad-flink --replicas=0 -n aiops
    sleep 2
    kubectl scale deployment aiops-lad-flink-taskmanager --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulsets zen-minio --replicas=0 -n aiops
    sleep 2
    kubectl scale statefulset aiops-zookeeper --replicas=0 -n aiops
    sleep 2

    Run the following command to check that the number of replicas for each of the StatefulSets is now 0.

    kubectl get statefulsets -n aiops
    

    Example output:

    NAME                                     READY   AGE
    aimanager-ibm-minio                       0/0     42m
    aiops-installation-redis-server           0/0     84m
    aiops-ir-analytics-cluster1-spark-worker  0/0     63m
    aiops-ir-core-ncobackup                   0/0     75m
    aiops-ir-core-ncoprimary                  0/0     76m
    aiops-topology-cassandra                  0/0     83m
    aiops-zookeeper                           0/0     76m
    c-example-couchdbcluster-m                0/0     77m
    zen-minio                                 0/0     76m
  4. Run the following command to check that the number of replicas for each of the Flink Deployments is now 0.

    kubectl get deployments -n aiops | grep flink | grep -v "operator"
    

    Example output:

    NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
    aiops-ir-lifecycle-flink                          0/0     0            0           137m
    aiops-ir-lifecycle-flink-taskmanager              0/0     0            0           137m
    aiops-lad-flink                                   0/0     0            0           139m
    aiops-lad-flink-taskmanager                       0/0     0            0           139m
    
  5. Shutdown the Kafka and system-controller pods.

    kubectl delete pod -l ibmevents.ibm.com/name=iaf-system-kafka -n aiops
    kubectl delete pod -l ibmevents.ibm.com/name=iaf-system-controller -n aiops

    Run the following command to check that the Kafka and system-controller pods have successfully shutdown. If the shutdown is complete, no pods are returned.

    kubectl get pod -l ibmevents.ibm.com/controller=strimzipodset -n aiops
  6. Scale down the PostgreSQL pods.

    When shutting down a PostgreSQL cluster, it is best to remove the primary replica last. The following script removes each database replica in the cluster with the primary removed last.

    #!/bin/bash
    
    AIOPS_NAMESPACE=aiops
    
    # Get array of Postgres clusters
    CLUSTERS=($(kubectl get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" -o go-template='{{range .items}}{{.metadata.name}}{{" "}}{{end}}'))
    
    # For each Postgres cluster, shutdown primary last
    for cluster_name in "${CLUSTERS[@]}"; do
        primary=$(kubectl get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{.status.currentPrimary}}')
        instances=($(kubectl get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{range .status.instanceNames}}{{print . " "}}{{end}}'))
        for instance_name in "${instances[@]}"; do
            # Shutdown non-primary replicas
            if [ "${instance_name}" != "${primary}" ]; then
                kubectl delete pod -n "${AIOPS_NAMESPACE}" "${instance_name}" --ignore-not-found
            fi
        done
    
        # Shutdown the primary once all other replicas are down
        kubectl delete pod -n "${AIOPS_NAMESPACE}" "${primary}" --ignore-not-found
    done
    

    Wait for all the Postgres pods to be deleted. All pods are deleted when the following command returns no pods:

    kubectl get pod -l k8s.enterprisedb.io/podRole=instance -n aiops
    
  7. (Optional) After the StatefulSets and StrimziPodSets are scaled down, drain all of the worker and control plane nodes.

    Skip this step if you are using this procedure to do a backup.

    From a control plane node, run the following command for each of the worker and control plane nodes:

    kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --disable-eviction
    

    Where <node> is the name of the node to drain.

5. Shut down the cluster

  1. Shut down all the worker nodes on the cluster.

  2. Shut down all the control plane nodes on the cluster.

6. Restart the cluster

  1. Re-export the environment variables that you saved in step 1.1.

  2. Restart the cluster nodes in the following order:

    1. Restart the control plane nodes. Check whether all the control plane nodes are in ready status by running the following command:

      kubectl get nodes
      
    2. Restart the worker nodes. Check whether all worker nodes are in ready status by running the following command:

      kubectl get nodes
      
  3. After all the nodes are up, uncordon the control plane and worker nodes.

    From a control plane node, run the following command for each of the worker and control plane nodes:

    kubectl uncordon <node>
    

    Where <node> is the name of the node to uncordon.

7. Scale up the workloads

Scaling up the workloads in the following order helps to minimize startup time and resource contention issues.

  1. Scale up the Events and OpenSearch operators.

    kubectl scale deployment --replicas=1 $(kubectl get deployment -o custom-columns=NAME:.metadata.name --no-headers -n aiops  | grep '^ibm-events-cluster-operator-') -n aiops
    kubectl scale deployment --replicas=1 $(kubectl get deployment -o custom-columns=NAME:.metadata.name --no-headers -n aiops | grep '^ibm-opensearch-operator-') -n aiops
    
  2. Check whether the Kafka controllers and brokers are running again. This can take a few minutes.

    kubectl get pod -l ibmevents.ibm.com/controller=strimzipodset -n aiops

    Example output when the Kafka controllers and brokers are running:

    NAME                        READY   STATUS    RESTARTS   AGE
    iaf-system-controller-100   1/1     Running   0          15d
    iaf-system-controller-101   1/1     Running   0          15d
    iaf-system-controller-102   1/1     Running   0          15d
    iaf-system-kafka-0          1/1     Running   0          15d
    iaf-system-kafka-1          1/1     Running   0          15d
    iaf-system-kafka-2          1/1     Running   0          15d
  3. You need to scale up each of the StatefulSets to the number of replicas as noted in step 3.2 Prepare to scale down.

    Note: You only need to wait for pods to be scheduled (status of ContainerCreating) before you start the next service listed in the procedure.
    1. Scale up the following StatefulSets in the specified order:

      • aiops-topology-cassandra
      • aiops-ir-analytics-cluster1-spark-worker
      • aimanager-ibm-minio
      • aiops-zookeeper

      Run the following command to scale up each StatefulSet:

      kubectl scale statefulsets <statefulset> --replicas=<number of replicas> -n aiops

      Where:

      • <statefulset> is the StatefulSet to be scaled up
      • <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to

      For example,

      kubectl scale statefulsets aiops-topology-cassandra --replicas=3 -n aiops
    2. Run the following command to make OpenSearch accept workloads:

      kubectl patch clusters.opensearch.cloudpackopen.ibm.com aiops-opensearch -n aiops --type=json -p='[{"op":"replace","path":"/spec/quiesce","value":false}]'
    3. Scale the Flink deployments in the following order:

      • aiops-ir-lifecycle-flink
      • aiops-ir-lifecycle-flink-taskmanager
      • aiops-lad-flink
      • aiops-lad-flink-taskmanager
      Note: If you have a base deployment , then do not scale up aiops-lad-flink and aiops-lad-flink-taskmanager.

      Run the following command to scale up the Flink deployments:

      kubectl scale deployment <flink_deployment> --replicas=<number of replicas> -n aiops
      

      Where <flink_deployment> is the name of the Flink deployment.

    4. Scale the following StatefulSets in the specified order:

      • aiops-ir-core-ncoprimary
      • aiops-ir-core-ncobackup
      • c-example-couchdbcluster-m
      • aiops-installation-redis-server
      • zen-minio
      Note: Only scale up aiops-ir-core-ncoprimary and aiops-ir-core-ncobackup if you are Connecting with on-premises probes.

      Run the following command to scale up each StatefulSet:

      kubectl scale statefulsets <statefulset> --replicas=<number of replicas> -n aiops
      

      Where:

      • <statefulset> is the StatefulSet to be scaled up
      • <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to
  4. Scale up the operator deployments.

    Run the aiops-operator-scale-up.sh script that you created earlier in step 4. Scale down the workloads and drain the nodes:

    chmod +x ./aiops-operator-scale-up.sh
    ./aiops-operator-scale-up.sh
    

8. Validate the installation

Note: After a complete cluster restart, it might take approximately an hour for the installation to start running again.

Run the describe command:

kubectl describe installations.orchestrator.aiops.ibm.com -n aiops

Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.