Restarting the environment (IBM Cloud Pak for AIOps on Linux)
Learn how to shutdown and restart the Linux cluster where IBM Cloud Pak for AIOps is deployed.
Overview
Use this procedure before a known maintenance window or outage to shut down the Linux cluster where IBM Cloud Pak for AIOps is installed, and to restart the cluster and workloads afterward.
Procedure
1. Validate the installation
Run the describe command:
kubectl describe installations.orchestrator.aiops.ibm.com -n aiops
Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.
Example output:
Name: aiops-installation
Namespace: aiops
API Version: orchestrator.aiops.ibm.com/v1alpha1
Kind: Installation
Spec:
...
Status:
Componentstatus:
Aimanager: Ready
Aiopsanalyticsorchestrator: Ready
Aiopsedge: Ready
Aiopsui: Ready
Asm: Ready
Baseui: Ready
Cassandra: Ready
cluster.aiops-orchestrator-postgres: Ready
cluster.opensearch: Ready
Commonservice: Ready
Flinkdeployment: Ready
Issueresolutioncore: Ready
Kafka: Ready
Lifecycleservice: Ready
Lifecycletrigger: Ready
Rediscp: Ready
Zenservice: Ready
Zookeeper: Ready
Phase: Running
2. Check the certificates
Ensure that none of the certificates have problems or are expired.
Run the following command:
while read l; do echo "$l" | grep '^NAME' || (n=$(echo $l | sed 's/ .*//'); s=$(echo $l | sed 's/^[^ ]* *\([^ ]*\).*/\1/'); x=$(kubectl get secret -n $n $s -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate 2>/dev/null | sed 's!notAfter=!!'); echo "$l" | sed 's![^ ][^ ]*$!'"$x"'!'); done< <(kubectl get secret -A --field-selector=type==kubernetes.io/tls -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,EXPIRY:.metadata.name)
NAMESPACE NAME EXPIRY
cp4aiops aimanager-certificate-secret Feb 22 17:15:20 2026 GMT
cp4aiops aiops-appconnect-ir-secret Feb 22 16:48:00 2026 GMT
cp4aiops aiops-ir-analytics-classifier-tls Feb 22 17:13:13 2026 GMT
cp4aiops aiops-ir-analytics-metric-api-tls Feb 22 17:13:17 2026 GMT
cp4aiops aiops-ir-analytics-metric-spark-tls Feb 22 17:13:10 2026 GMT
cp4aiops aiops-ir-analytics-postgres-client-cert Feb 22 16:48:05 2026 GMT
cp4aiops aiops-ir-analytics-postgres-server-cert Feb 22 16:48:44 2026 GMTRenew or re-create any certificates that have problems, are expired, or will expire before the cluster is restarted.
3. Prepare to scale down
-
Cordon all of the worker and control plane nodes.
From a control plane node, run the following command for each of the worker and control plane nodes:
kubectl cordon <node>Where
<node>is the name of the node to cordon. -
Make a note of the number of replicas.
-
Make a note of the number of replicas for each StatefulSet.
kubectl get statefulsets -n aiopsExample output:NAME READY AGE aimanager-ibm-minio 5/5 16d aiops-ir-analytics-cluster1-spark-worker 2/2 16d aiops-ir-core-ncobackup 1/1 16d aiops-ir-core-ncoprimary 1/1 16d aiops-topology-cassandra 3/3 16d aiops-zookeeper 3/3 15d c-example-couchdbcluster-m 3/3 16d ibm-cp-aiops-redis-server 3/3 15d zen-minio 3/3 16dNote:- The
aiops-zookeeperpods do not exist on starter size deployments. - If you do not have a IBM Netcool Operations Insight probe integration, then
aiops-ir-core-ncobackupandaiops-ir-core-ncoprimaryhas zero replicas. - If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodbStatefulSet.
- The
-
Make a note of the number of replicas for each StrimziPodSet.
kubectl get strimzipodset -n aiopsExample output:
NAME PODS READY PODS CURRENT PODS AGE iaf-system-controller 3 3 3 13d iaf-system-kafka 3 3 3 13d -
Make a note of the number of replicas for each Flink deployment by using the following command:
kubectl get deployment | grep flink | grep -v "operator"Example output:
NAME PODS READY PODS CURRENT PODS AGE aiops-ir-lifecycle-flink 1/1 1 1 137m aiops-ir-lifecycle-flink-taskmanager 1/1 1 1 137m aiops-lad-flink 1/1 1 1 139m aiops-lad-flink-taskmanager 2/2 2 2 139mNote:- If you have a
base deployment
, then
aiops-lad-flinkandaiops-lad-flink-taskmanagerdo not show in the preceding output. - If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodbStatefulSet.
- If you have a
base deployment
, then
-
4. Scale down the workloads and drain the nodes
-
Quiesce the OpenSearch cluster.
kubectl patch clusters.opensearch.cloudpackopen.ibm.com aiops-opensearch -n aiops --type=json -p='[{"op":"replace","path":"/spec/quiesce","value":true}]'Run the following command to verify that all OpenSearch pods are removed and the OpenSearch cluster is quiesced.
kubectl get pod -l cluster.opensearch.cloudpackopen.ibm.com=aiops-opensearch -n aiopsExample output:
No resources found in aiops namespace. -
Scale down the operator deployments in the IBM Cloud Pak for AIOps namespace.
-
Run the following command to create two scripts named
aiops-operator-scale-down.shandaiops-operator-scale-up.sh. The scripts get the current replica count for all deployments, and then scale the replicas up or down.kubectl get deploy -n aiops -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n aiops %s --replicas=0\n" .metadata.name }}{{end}}' > aiops-operator-scale-down.sh kubectl get deploy -n aiops -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n aiops %s --replicas=%d\n" .metadata.name .spec.replicas }}{{end}}' > aiops-operator-scale-up.sh -
Scale down the operators.
Run the following commands:
chmod +x ./aiops-operator-scale-down.sh ./aiops-operator-scale-down.sh -
Run the following command to check that the number of replicas for each of the operator deployments is now 0.
kubectl get deployment -n aiops -l olm.owner.kind=ClusterServiceVersionExample output:NAME READY UP-TO-DATE AVAILABLE AGE aimanager-operator-controller-manager 0/0 0 0 47m aiopsedge-operator-controller-manager 0/0 0 0 47m asm-operator 0/0 0 0 47m iaf-flink-operator-controller-manager 0/0 0 0 54m ibm-aiops-orchestrator-controller-manager 0/0 0 0 58m ibm-common-service-operator 0/0 0 0 56m ibm-commonui-operator 0/0 0 0 53m ibm-opensearch-operator-controller-manager 0/0 0 0 54m ibm-events-cluster-operator-v6.0.0 0/0 0 0 54m ibm-iam-operator 0/0 0 0 54m ibm-ir-ai-operator-controller-manager 0/0 0 0 47m ibm-redis-cp-operator 0/0 0 0 49m ibm-secure-tunnel-operator 0/0 0 0 48m ibm-watson-aiops-ui-operator-controller-manager 0/0 0 0 48m ibm-zen-operator 0/0 0 0 54m ir-core-operator-controller-manager 0/0 0 0 47m ir-lifecycle-operator-controller-manager 0/0 0 0 47m operand-deployment-lifecycle-manager 0/0 0 0 55m postgresql-operator-controller-manager-1-18-12 0/0 0 0 54mNote: If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have anicp-mongodb-operatordeployment.
-
-
Scale down the StatefulSets that you noted in step 3.2.
You can use the Cloud Pak for AIOps console, or create a shell script to do this.
If you have a base deployment , then remove the following lines from the example shell script:
kubectl scale deployment aiops-lad-flink --replicas=0 -n aiops kubectl scale deployment aiops-lad-flink-taskmanager --replicas=0 -n aiopsNote:- If you have a
base deployment
, then
aiops-lad-flinkandaiops-lad-flink-taskmanagerdo not show in the preceding output. - If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you also have an
icp-mongodbStatefulSet.
If you upgraded from an earlier version of IBM Cloud Pak for AIOps, then add the following line to the example shell script:
kubectl scale statefulsets icp-mongodb --replicas=0 -n aiopsExample shell script:
#!/bin/bash kubectl scale statefulsets aimanager-ibm-minio --replicas=0 -n aiops sleep 2 kubectl scale statefulsets aiops-installation-redis-server --replicas=0 -n aiops sleep 2 kubectl scale statefulsets aiops-ir-analytics-cluster1-spark-worker --replicas=0 -n aiops sleep 2 kubectl scale statefulsets aiops-ir-core-ncobackup --replicas=0 -n aiops sleep 2 kubectl scale statefulsets aiops-ir-core-ncoprimary --replicas=0 -n aiops sleep 2 kubectl scale deployment aiops-ir-lifecycle-flink --replicas=0 -n aiops sleep 2 kubectl scale deployment aiops-ir-lifecycle-flink-taskmanager --replicas=0 -n aiops sleep 2 kubectl scale statefulsets aiops-topology-cassandra --replicas=0 -n aiops sleep 2 kubectl scale statefulsets c-example-couchdbcluster-m --replicas=0 -n aiops sleep 2 kubectl scale deployment aiops-lad-flink --replicas=0 -n aiops sleep 2 kubectl scale deployment aiops-lad-flink-taskmanager --replicas=0 -n aiops sleep 2 kubectl scale statefulsets zen-minio --replicas=0 -n aiops sleep 2 kubectl scale statefulset aiops-zookeeper --replicas=0 -n aiops sleep 2Run the following command to check that the number of replicas for each of the StatefulSets is now 0.
kubectl get statefulsets -n aiopsExample output:
NAME READY AGE aimanager-ibm-minio 0/0 42m aiops-installation-redis-server 0/0 84m aiops-ir-analytics-cluster1-spark-worker 0/0 63m aiops-ir-core-ncobackup 0/0 75m aiops-ir-core-ncoprimary 0/0 76m aiops-topology-cassandra 0/0 83m aiops-zookeeper 0/0 76m c-example-couchdbcluster-m 0/0 77m zen-minio 0/0 76m - If you have a
base deployment
, then
-
Run the following command to check that the number of replicas for each of the Flink Deployments is now 0.
kubectl get deployments -n aiops | grep flink | grep -v "operator"Example output:
NAME READY UP-TO-DATE AVAILABLE AGE aiops-ir-lifecycle-flink 0/0 0 0 137m aiops-ir-lifecycle-flink-taskmanager 0/0 0 0 137m aiops-lad-flink 0/0 0 0 139m aiops-lad-flink-taskmanager 0/0 0 0 139m -
Shutdown the
Kafkaandsystem-controllerpods.kubectl delete pod -l ibmevents.ibm.com/name=iaf-system-kafka -n aiops kubectl delete pod -l ibmevents.ibm.com/name=iaf-system-controller -n aiopsRun the following command to check that the Kafka and system-controller pods have successfully shutdown. If the shutdown is complete, no pods are returned.
kubectl get pod -l ibmevents.ibm.com/controller=strimzipodset -n aiops -
Scale down the PostgreSQL pods.
When shutting down a PostgreSQL cluster, it is best to remove the primary replica last. The following script removes each database replica in the cluster with the primary removed last.
#!/bin/bash AIOPS_NAMESPACE=aiops # Get array of Postgres clusters CLUSTERS=($(kubectl get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" -o go-template='{{range .items}}{{.metadata.name}}{{" "}}{{end}}')) # For each Postgres cluster, shutdown primary last for cluster_name in "${CLUSTERS[@]}"; do primary=$(kubectl get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{.status.currentPrimary}}') instances=($(kubectl get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{range .status.instanceNames}}{{print . " "}}{{end}}')) for instance_name in "${instances[@]}"; do # Shutdown non-primary replicas if [ "${instance_name}" != "${primary}" ]; then kubectl delete pod -n "${AIOPS_NAMESPACE}" "${instance_name}" --ignore-not-found fi done # Shutdown the primary once all other replicas are down kubectl delete pod -n "${AIOPS_NAMESPACE}" "${primary}" --ignore-not-found doneWait for all the Postgres pods to be deleted. All pods are deleted when the following command returns no pods:
kubectl get pod -l k8s.enterprisedb.io/podRole=instance -n aiops -
(Optional) After the StatefulSets and StrimziPodSets are scaled down, drain all of the worker and control plane nodes.
Skip this step if you are using this procedure to do a backup.
From a control plane node, run the following command for each of the worker and control plane nodes:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --disable-evictionWhere
<node>is the name of the node to drain.
5. Shut down the cluster
-
Shut down all the worker nodes on the cluster.
-
Shut down all the control plane nodes on the cluster.
6. Restart the cluster
-
Re-export the environment variables that you saved in step 1.1.
-
Restart the cluster nodes in the following order:
-
Restart the control plane nodes. Check whether all the control plane nodes are in
readystatus by running the following command:kubectl get nodes -
Restart the worker nodes. Check whether all worker nodes are in
readystatus by running the following command:kubectl get nodes
-
-
After all the nodes are up, uncordon the control plane and worker nodes.
From a control plane node, run the following command for each of the worker and control plane nodes:
kubectl uncordon <node>Where
<node>is the name of the node to uncordon.
7. Scale up the workloads
Scaling up the workloads in the following order helps to minimize startup time and resource contention issues.
-
Scale up the Events and OpenSearch operators.
kubectl scale deployment --replicas=1 $(kubectl get deployment -o custom-columns=NAME:.metadata.name --no-headers -n aiops | grep '^ibm-events-cluster-operator-') -n aiops kubectl scale deployment --replicas=1 $(kubectl get deployment -o custom-columns=NAME:.metadata.name --no-headers -n aiops | grep '^ibm-opensearch-operator-') -n aiops -
Check whether the Kafka controllers and brokers are running again. This can take a few minutes.
kubectl get pod -l ibmevents.ibm.com/controller=strimzipodset -n aiopsExample output when the Kafka controllers and brokers are running:
NAME READY STATUS RESTARTS AGE iaf-system-controller-100 1/1 Running 0 15d iaf-system-controller-101 1/1 Running 0 15d iaf-system-controller-102 1/1 Running 0 15d iaf-system-kafka-0 1/1 Running 0 15d iaf-system-kafka-1 1/1 Running 0 15d iaf-system-kafka-2 1/1 Running 0 15d -
You need to scale up each of the StatefulSets to the number of replicas as noted in step 3.2 Prepare to scale down.
Note: You only need to wait for pods to be scheduled (status ofContainerCreating) before you start the next service listed in the procedure.-
Scale up the following StatefulSets in the specified order:
- aiops-topology-cassandra
- aiops-ir-analytics-cluster1-spark-worker
- aimanager-ibm-minio
- aiops-zookeeper
Run the following command to scale up each StatefulSet:
kubectl scale statefulsets <statefulset> --replicas=<number of replicas> -n aiopsWhere:
-
<statefulset>is the StatefulSet to be scaled up -
<number_of_replicas>is the number of replicas the StatefulSet it to be scaled up to
For example,
kubectl scale statefulsets aiops-topology-cassandra --replicas=3 -n aiops -
Run the following command to make OpenSearch accept workloads:
kubectl patch clusters.opensearch.cloudpackopen.ibm.com aiops-opensearch -n aiops --type=json -p='[{"op":"replace","path":"/spec/quiesce","value":false}]' -
Scale the Flink deployments in the following order:
- aiops-ir-lifecycle-flink
- aiops-ir-lifecycle-flink-taskmanager
- aiops-lad-flink
- aiops-lad-flink-taskmanager
Note: If you have a base deployment , then do not scale upaiops-lad-flinkandaiops-lad-flink-taskmanager.Run the following command to scale up the Flink deployments:
kubectl scale deployment <flink_deployment> --replicas=<number of replicas> -n aiopsWhere
<flink_deployment>is the name of the Flink deployment. -
Scale the following StatefulSets in the specified order:
- aiops-ir-core-ncoprimary
- aiops-ir-core-ncobackup
- c-example-couchdbcluster-m
- aiops-installation-redis-server
- zen-minio
Note: Only scale upaiops-ir-core-ncoprimaryandaiops-ir-core-ncobackupif you are Connecting with on-premises probes.Run the following command to scale up each StatefulSet:
kubectl scale statefulsets <statefulset> --replicas=<number of replicas> -n aiopsWhere:
-
<statefulset>is the StatefulSet to be scaled up -
<number_of_replicas>is the number of replicas the StatefulSet it to be scaled up to
-
-
Scale up the operator deployments.
Run the
aiops-operator-scale-up.shscript that you created earlier in step 4. Scale down the workloads and drain the nodes:chmod +x ./aiops-operator-scale-up.sh ./aiops-operator-scale-up.sh
8. Validate the installation
Run the describe command:
kubectl describe installations.orchestrator.aiops.ibm.com -n aiops
Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.