Restarting the environment (IBM Cloud Pak for AIOps on OpenShift)
Learn how to shutdown and restart the Red Hat OpenShift cluster where IBM Cloud Pak for AIOps is deployed.
Overview
Use this procedure before a known maintenance window or outage to shut down the Red Hat OpenShift cluster where IBM Cloud Pak for AIOps is installed, and to restart the cluster and workloads afterward.
Warning: If you need to shut down the cluster where IBM Cloud Pak for AIOps is installed, then you must use the following procedure. Failure to do so can result in data loss or corruption.
Procedure
1. Validate the installation
-
Set environment variables.
Make a note of these environment variables or save them to a file, as you will need to export them again after you restart your cluster.
export AIOPS_NAMESPACE=<project> export AIOPS_INSTANCE=$(oc get installation.orchestrator.aiops.ibm.com -o jsonpath='{.items[0].metadata.name}' -n ${AIOPS_NAMESPACE})Where
<project>is the namespace (project) that your IBM Cloud Pak for AIOps installation is deployed in. -
Run the describe command:
oc describe installations.orchestrator.aiops.ibm.com -n "${AIOPS_NAMESPACE}"Review the
ComponentStatusfields to confirm that all components are marked asReadyand the phase isRunning.Example output:
Name: ibm-cp-aiops Namespace: aiops API Version: orchestrator.aiops.ibm.com/v1alpha1 Kind: Installation Spec: ... Status: Componentstatus: Aimanager: Ready Aiopsanalyticsorchestrator: Ready Aiopsedge: Ready Aiopsui: Ready Asm: Ready Baseui: Ready Cluster: Ready Commonservice: Ready Elasticsearchcluster: Ready Flinkdeployment: Ready Issueresolutioncore: Ready Kafka: Ready Lifecycleservice: Ready Lifecycletrigger: Ready Rediscp: Ready Tunnel: Ready Zenservice: Ready Phase: Running
2. Check the certificates
Ensure that none of the certificates have problems or are expired.
Run the following command:
while read l; do echo "$l" | grep '^NAME' || (n=$(echo $l | sed 's/ .*//'); s=$(echo $l | sed 's/^[^ ]* *\([^ ]*\).*/\1/'); x=$(oc get secret -n $n $s -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate 2>/dev/null | sed 's!notAfter=!!'); echo "$l" | sed 's![^ ][^ ]*$!'"$x"'!'); done< <(oc get secret -A --field-selector=type==kubernetes.io/tls -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,EXPIRY:.metadata.name)
Example output excerpt:
ibm-licensing ibm-license-service-cert Jan 8 13:32:07 2025 GMT
ibm-licensing ibm-license-service-cert-internal Jan 7 13:31:12 2026 GMT
ibm-licensing ibm-licensing-service-prometheus-cert Jan 7 13:31:25 2026 GMT
cp4aiops aimanager-aio-log-anomaly-feedback-learning-cert Jan 7 14:01:43 2026 GMT
cp4aiops aimanager-aio-log-anomaly-golden-signals-cert Jan 7 14:01:43 2026 GMT
cp4aiops aimanager-aio-oob-recommended-actions-cert Jan 7 14:01:43 2026 GMT
<...>
Renew or re-create any certificates that have problems, are expired, or will expire before the cluster is restarted. For more information about certificate management, see Renew or re-create certificates in Openshift 4.x .
3. Prepare to scale down
-
Cordon the worker nodes.
Run the following command for each of the worker nodes:
oc adm cordon <node>Where
<node>is the name of the node to cordon. -
Make a note of the number of replicas.
-
Make a note of the number of replicas for each StatefulSet.
oc get statefulsets -n ${AIOPS_NAMESPACE}Example output:
NAME READY AGE aimanager-ibm-minio 1/1 18m aiops-ir-analytics-spark-worker 2/2 33m aiops-ir-core-ncobackup 1/1 37m aiops-ir-core-ncoprimary 1/1 39m aiops-topology-cassandra 1/1 43m c-example-couchdbcluster-m 1/1 40m aiops-ibm-elasticsearch-es-server-all 1/1 49m ibm-cp-aiops-redis-server 3/3 45m zen-minio 3/3 40m
Note:
- If you do not have a IBM® Netcool® Operations Insight® probe integration, then
aiops-ir-core-ncobackupandaiops-ir-core-ncoprimaryhas zero replicas. - If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodbStatefulSet.
- If you do not have a IBM® Netcool® Operations Insight® probe integration, then
-
Make a note of the number of replicas for each StrimziPodSet.
oc get strimzipodset -n ${AIOPS_NAMESPACE}Example output:
NAME PODS READY PODS CURRENT PODS AGE iaf-system-kafka 3 3 3 13d iaf-system-zookeeper 3 3 3 13d
-
Make a note of the number of replicas for each Flink deployment by using the following command:
oc get deployment | grep flink | grep -v "operator"Example output:
NAME PODS READY PODS CURRENT PODS AGE aiops-ir-lifecycle-flink 1/1 1 1 137m aiops-ir-lifecycle-flink-taskmanager 1/1 1 1 137m aiops-lad-flink 1/1 1 1 139m aiops-lad-flink-taskmanager 2/2 2 2 139mNotes:
- If you have a base deployment, then
aiops-lad-flinkandaiops-lad-flink-taskmanagerdo not show in the preceding output. - If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodbStatefulSet.
- If you have a base deployment, then
-
4. Scale down the workloads and drain the nodes
-
Scale down the operator deployments in the IBM Cloud Pak for AIOps namespace.
-
Run the following command to create two scripts named
aiops-operator-scale-down.shandaiops-operator-scale-up.sh. The scripts get the current replica count for all deployments, and then scale the replicas up or down.oc get deploy -n "${AIOPS_NAMESPACE}" -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n '"${AIOPS_NAMESPACE}"' %s --replicas=0\n" .metadata.name }}{{end}}' > aiops-operator-scale-down.sh oc get deploy -n "${AIOPS_NAMESPACE}" -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n '"${AIOPS_NAMESPACE}"' %s --replicas=%d\n" .metadata.name .spec.replicas }}{{end}}' > aiops-operator-scale-up.sh -
Scale down the operators.
Run the following commands:
chmod +x ./aiops-operator-scale-down.sh ./aiops-operator-scale-down.sh -
Run the following command to check that the number of replicas for each of the operator deployments is now 0.
oc get deployment -n ${AIOPS_NAMESPACE} -l olm.owner.kind=ClusterServiceVersionExample output:
NAME READY UP-TO-DATE AVAILABLE AGE aimanager-operator-controller-manager 0/0 0 0 47m aiopsedge-operator-controller-manager 0/0 0 0 47m asm-operator 0/0 0 0 47m flink-kubernetes-operator 0/0 0 0 54m ibm-aiops-orchestrator-controller-manager 0/0 0 0 58m ibm-common-service-operator 0/0 0 0 56m ibm-commonui-operator 0/0 0 0 53m ibm-elasticsearch-operator-ibm-es-controller-manager 0/0 0 0 54m ibm-events-operator-v5.0.1 0/0 0 0 54m ibm-iam-operator 0/0 0 0 54m ibm-ir-ai-operator-controller-manager 0/0 0 0 47m ibm-redis-cp-operator 0/0 0 0 49m ibm-secure-tunnel-operator 0/0 0 0 48m ibm-watson-aiops-ui-operator-controller-manager 0/0 0 0 48m ibm-zen-operator 0/0 0 0 54m ir-core-operator-controller-manager 0/0 0 0 47m ir-lifecycle-operator-controller-manager 0/0 0 0 47m operand-deployment-lifecycle-manager 0/0 0 0 55m postgresql-operator-controller-manager-1-18-12 0/0 0 0 54m
Note: If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodb-operatordeployment.
-
-
Scale down the StatefulSets that you noted in step 3.2.
You can use the Cloud Pak for AIOps console, or create a shell script to do this.
If you have a base deployment, then remove the following lines from the example shell script:
oc scale deployment aiops-lad-flink --replicas=0 -n ${AIOPS_NAMESPACE} oc scale deployment aiops-lad-flink-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}Note:
- If you have a base deployment, then
aiops-lad-flinkandaiops-lad-flink-taskmanagerdo not show in the preceding output. - If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodbStatefulSet.
If you upgraded from an earlier version of IBM Cloud Pak for AIOps, then add the following line to the example shell script:
oc scale statefulsets icp-mongodb --replicas=0 -n ${AIOPS_NAMESPACE}Example shell script:
#!/bin/bash oc scale statefulsets aimanager-ibm-minio --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets ${AIOPS_INSTANCE}-redis-server --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets aiops-ir-analytics-spark-worker --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets aiops-ir-core-ncobackup --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets aiops-ir-core-ncoprimary --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale deployment aiops-ir-lifecycle-flink --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale deployment aiops-ir-lifecycle-flink-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets aiops-topology-cassandra --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets c-example-couchdbcluster-m --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale deployment aiops-lad-flink --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale deployment aiops-lad-flink-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets -l app.kubernetes.io/managed-by=ibm-elasticsearch --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2 oc scale statefulsets zen-minio --replicas=0 -n ${AIOPS_NAMESPACE} sleep 2Run the following command to check that the number of replicas for each of the StatefulSets is now 0.
oc get statefulsets -n ${AIOPS_NAMESPACE}Example output:
NAME READY AGE aimanager-ibm-minio 0/0 112m aiops-ir-analytics-spark-worker 0/0 128m aiops-ir-core-ncobackup 0/0 131m aiops-ir-core-ncoprimary 0/0 133m aiops-topology-cassandra 0/0 138m c-example-couchdbcluster-m 0/0 134m aiops-ibm-elasticsearch-es-server-all 0/0 143m ibm-cp-aiops-redis-server 0/0 140m zen-minio 0/0 134m
- If you have a base deployment, then
-
Run the following command to check that the number of replicas for each of the Flink Deployments is now 0.
oc get deployments -n ${AIOPS_NAMESPACE} | grep flink | grep -v "operator"Example output:
NAME READY UP-TO-DATE AVAILABLE AGE aiops-ir-lifecycle-flink 0/0 0 0 137m aiops-ir-lifecycle-flink-taskmanager 0/0 0 0 137m aiops-lad-flink 0/0 0 0 139m aiops-lad-flink-taskmanager 0/0 0 0 139m
-
Shutdown the
KafkaandZooKeeperpods.oc delete pod -l ibmevents.ibm.com/name=iaf-system-kafka -n ${AIOPS_NAMESPACE} oc delete pod -l ibmevents.ibm.com/name=iaf-system-zookeeper -n ${AIOPS_NAMESPACE}Run the following command to check that the Kafka and ZooKeeper pods have successfully shutdown. If the shutdown is complete, no pods are returned.
oc get pod -l ibmevents.ibm.com/controller=strimzipodset -n ${AIOPS_NAMESPACE} -
Scale down the PostgreSQL pods.
When shutting down a PostgreSQL cluster, it is best to remove the primary replica last. The following script removes each database replica in the cluster with the primary removed last.
Before running the script, replace
<project>with the namespace (project) that your IBM Cloud Pak for AIOps installation is deployed in.#!/bin/bash AIOPS_NAMESPACE=<project> # Get array of Postgres clusters CLUSTERS=($(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" -o go-template='{{range .items}}{{.metadata.name}}{{" "}}{{end}}')) # For each Postgres cluster, shutdown primary last for cluster_name in "${CLUSTERS[@]}"; do primary=$(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{.status.currentPrimary}}') instances=($(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{range .status.instanceNames}}{{print . " "}}{{end}}')) for instance_name in "${instances[@]}"; do # Shutdown non-primary replicas if [ "${instance_name}" != "${primary}" ]; then oc delete pod -n "${AIOPS_NAMESPACE}" "${instance_name}" --ignore-not-found fi done # Shutdown the primary once all other replicas are down oc delete pod -n "${AIOPS_NAMESPACE}" "${primary}" --ignore-not-found doneWait for all the Postgres pods to be deleted. All pods are deleted when the following command returns no pods:
oc get pod -l k8s.enterprisedb.io/podRole=instance -n ${AIOPS_NAMESPACE} -
(Optional) After the StatefulSets and StrimziPodSets are scaled down, drain the worker nodes.
Run the following command for each of the worker nodes:
oc adm drain <node> --ignore-daemonsets --delete-emptydir-data --disable-evictionWhere
<node>is the name of the node to drain.
5. Shut down the cluster
-
Shut down all the worker nodes on the cluster.
-
Shut down all the master nodes on the cluster.
-
Shut down the API node on the cluster.
For more information about shutting down your cluster nodes, see step 4 in the Red Hat OpenShift documentation Shutting down a cluster gracefully .
6. Restart the cluster
-
Re-export the environment variables that you saved in step 1.1.
-
Restart the cluster nodes in the following order:
-
Restart the API node.
-
Restart the master nodes. Check whether all master nodes are in
readystatus by running the following command:oc get nodes -
Restart the worker nodes. Check whether all worker nodes are in
readystatus by running the following command:oc get nodes
-
-
After all the nodes are up, uncordon the worker nodes.
Run the following command for each of the worker nodes:
oc adm uncordon <node>Where
<node>is the name of the node to uncordon.
7. Scale up the workloads
Scaling up the workloads in the following order helps to minimize startup time and resource contention issues.
-
Scale the events operator back up.
oc scale deployment --replicas=1 $(oc get deployment -o custom-columns=NAME:.metadata.name --no-headers -n ${AIOPS_NAMESPACE} | grep '^ibm-events-operator-') -n ${AIOPS_NAMESPACE} -
Check whether the Kafka and Zookeeper pods are running again. This can take a few minutes.
oc get pod -l ibmevents.ibm.com/controller=strimzipodset -n ${AIOPS_NAMESPACE}Example output when the Kafka and Zookeeper pods are running:
NAME READY STATUS RESTARTS AGE iaf-system-kafka-0 1/1 Running 0 13d iaf-system-kafka-1 1/1 Running 0 13d iaf-system-kafka-2 1/1 Running 0 13d iaf-system-zookeeper-0 1/1 Running 0 13d iaf-system-zookeeper-1 1/1 Running 0 13d iaf-system-zookeeper-2 1/1 Running 0 13d -
You need to scale up each of the StatefulSets to the number of replicas as noted in step 3.2.
Note: You only need to wait for pods to be scheduled (status of
ContainerCreating) before you start the next service listed in the procedure.-
Scale up the following StatefulSets in the specified order:
- aiops-topology-cassandra
- aiops-ibm-elasticsearch-es-server-all
- aiops-ir-analytics-spark-worker
- aimanager-ibm-minio
Run the following command to scale up each StatefulSet:
oc scale statefulsets <statefulset> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}Where:
<statefulset>is the StatefulSet to be scaled up<number_of_replicas>is the number of replicas the StatefulSet it to be scaled up to
For example,
oc scale statefulsets aiops-topology-cassandra --replicas=1 -n cp4aiopsNote: If IBM Cloud Pak for AIOps is deployed on a multi-zone architecture, then there are multiple
ElasticsearchStatefulSets with numbered zone names. EachElasticsearchStatefulSet must be scaled up during this step. -
Scale the Flink deployments in the following order:
- aiops-ir-lifecycle-flink
- aiops-ir-lifecycle-flink-taskmanager
- aiops-lad-flink
- aiops-lad-flink-taskmanager
Note: If you have a base deployment, then do not scale up
aiops-lad-flinkandaiops-lad-flink-taskmanager.Run the following command to scale up the Flink deployments:
oc scale deployment <flink_deployment> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}Where
<flink_deployment>is the name of the Flink deployment. -
Scale the following StatefulSets in the specified order:
- aiops-ir-core-ncoprimary
- aiops-ir-core-ncobackup
- c-example-couchdbcluster-m
- ${AIOPS_INSTANCE}-redis-server
- zen-minio
Note Only scale up
aiops-ir-core-ncoprimaryandaiops-ir-core-ncobackupif you are Connecting with on-premises probes.Run the following command to scale up each StatefulSet:
oc scale statefulsets <statefulset> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}Where:
<statefulset>is the StatefulSet to be scaled up<number_of_replicas>is the number of replicas the StatefulSet it to be scaled up to
-
-
Scale up the operator deployments.
Run the
aiops-operator-scale-up.shscript that you created earlier in step 4. Scale down the workloads and drain the nodes:chmod +x ./aiops-operator-scale-up.sh ./aiops-operator-scale-up.sh
8. Validate the installation
Note: After a complete cluster restart, it might take approximately an hour for the installation to start running again.
Run the describe command:
oc describe installations.orchestrator.aiops.ibm.com -n "${AIOPS_NAMESPACE}"
Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.