Recovering from node failure (IBM Cloud Pak for AIOps on Linux)
Learn about recovering from a failed worker or control plane node on a Linux® cluster that has a deployment of IBM Cloud Pak® for AIOps running on it.
Overview
IBM Cloud Pak for AIOps is highly available and able to withstand a single node failure. IBM Cloud Pak for AIOps replicates stateful and nonstateful pods and storage, and can also move nonstateful pods to alternative nodes if the node that they are scheduled on fails.
If a node in your Linux cluster fails and is inaccessible, use the procedure in this topic to delete it. Stateful pods that were on the failed node are rescheduled to alternative nodes, and data is replicated from other healthy replicas.
Note: The following procedures are also available:
- if you no longer want a node in your Linux cluster to be used by your IBM Cloud Pak for AIOps deployment, follow the instructions in Removing a node.
- if you want to uninstall IBM Cloud Pak for AIOps completely, follow the instructions in Uninstalling IBM Cloud Pak for AIOps on Linux.
Procedure
-
Run the following command to identify a failed node.
oc get node
A failed worker node has a STATUS of
NotReady
and does not have a ROLE ofcontrol-plane.etcd.master
.A failed control plane node has a STATUS of
NotReady
and has a ROLE ofcontrol-plane.etcd.master
.Example output for a Linux cluster with a failed worker node called
agent1.acme.com
.NAME STATUS ROLES AGE VERSION agent1.acme.com NotReady <none> 7d2h v1.30.2+k3s2 agent2.acme.com Ready <none> 7d2h v1.30.2+k3s2 agent3.acme.com Ready <none> 7d2h v1.30.2+k3s2 agent4.acme.com Ready <none> 7d2h v1.30.2+k3s2 agent5.acme.com Ready <none> 7d2h v1.30.2+k3s2 agent6.acme.com Ready <none> 7d2h v1.30.2+k3s2 agent7.acme.com Ready <none> 7d2h v1.30.2+k3s2 server1.acme.com Ready control-plane.etcd.master 7d2h v1.30.2+k3s2 server2.acme.com Ready control-plane.etcd.master 7d2h v1.30.2+k3s2 server3.acme.com Ready control-plane.etcd.master 7d2h v1.30.2+k3s2
-
Add another node to your Linux cluster so that replicas can be distributed across the additional node.
- If you have a failed worker node, follow the instructions in Adding a worker node.
- If you have a failed control plane node, follow the instructions in Adding a control plane node.
-
Delete the failed node.
Run the following command from a control plane node to delete the failed node and relocate its workloads and storage to other nodes in the cluster.
aiopsctl cluster delete-node <node_name>
Where
<node_name>
is the name of the failed worker or control plane node.You are prompted with a warning that data loss might occur, and a query whether to continue. To avoid data loss, examine the
Data stored on node
table to check that the replication of your data stores is sufficient such that data loss does not occur if you continue to delete the node. Check that the last columnReplicas remaining after deletion
does not have 0 for any row, and that an extra warning message to contact IBM Support is not displayed. Typey
at the prompt if you want to continue.Example output for a cluster where it is safe to continue and remove the node:
$ aiopsctl cluster delete-node agent1.acme.com [WARN] Removing node agent1.acme.com. This may cause data loss, read the information below carefully before continuing. Data stored on node: Name Total replicas Replicas on node Replicas remaining after deletion Kafka 3 0 3 Identity Management Postgres 2 0 2 Zen MinIO 3 1 2 CouchDB 3 0 3 Elasticsearch 3 0 3 Zen Postgres 2 0 2 AIOps Postgres 3 1 2 AIOps MinIO 5 0 5 Redis 3 1 2 Cassandra 3 0 3 Are you sure you wish to continue? (y/n): y o- [24 Jul 24 08:44 CDT] Deleting node agent1.acme.com... o- [24 Jul 24 08:44 CDT] Recovering datastores... # Node is removed
Note: If you have only one control plane node, then you cannot delete that node, and
aiopsctl
prevents you from doing so. -
Update the environment variables file
aiops_var.sh
.Edit the shell script
aiops_var.sh
that you created when you installed IBM Cloud Pak for AIOps on your Linux cluster. This is to maintain the accuracy of the environment variables file.For more information about
aiops_var.sh
, see Create environment variables in Online installation of IBM Cloud Pak for AIOps on Linux or Offline installation of IBM Cloud Pak for AIOps on Linux.If you deleted a worker node:
Remove the entry for the deleted worker node from
aiops_var.sh
.If you deleted a control plane node:
- If the deleted control plane node is in the ADDITIONAL_CONTROL_PLANES array, then remove the entry for the deleted control plane node.
- If the deleted control plane node is the CONTROL_PLANE_NODE, then replace the value of CONTROL_PLANE_NODE with the value of the new main control plane node that you created in step 2, if you have not already done so.
-
If you added and deleted a control plane node, then update your load balancer's configuration.