Recovering from node failure (IBM Cloud Pak for AIOps on Linux)

Learn about recovering from a failed worker or control plane node on a Linux® cluster that has a deployment of IBM Cloud Pak® for AIOps running on it.

Overview

IBM Cloud Pak for AIOps is highly available and able to withstand a single node failure. IBM Cloud Pak for AIOps replicates stateful and nonstateful pods and storage, and can also move nonstateful pods to alternative nodes if the node that they are scheduled on fails.

If a node in your Linux cluster fails and is inaccessible, use the procedure in this topic to delete it. Stateful pods that were on the failed node are rescheduled to alternative nodes, and data is replicated from other healthy replicas.

Note: The following procedures are also available:

Procedure

  1. Run the following command to identify a failed node.

    oc get node
    

    A failed worker node has a STATUS of NotReady and does not have a ROLE of control-plane.etcd.master.

    A failed control plane node has a STATUS of NotReady and has a ROLE of control-plane.etcd.master.

    Example output for a Linux cluster with a failed worker node called agent1.acme.com.

    NAME              STATUS    ROLES                      AGE   VERSION
    agent1.acme.com   NotReady  <none>                     7d2h  v1.30.2+k3s2
    agent2.acme.com   Ready     <none>                     7d2h  v1.30.2+k3s2
    agent3.acme.com   Ready     <none>                     7d2h  v1.30.2+k3s2
    agent4.acme.com   Ready     <none>                     7d2h  v1.30.2+k3s2
    agent5.acme.com   Ready     <none>                     7d2h  v1.30.2+k3s2
    agent6.acme.com   Ready     <none>                     7d2h  v1.30.2+k3s2
    agent7.acme.com   Ready     <none>                     7d2h  v1.30.2+k3s2
    server1.acme.com  Ready     control-plane.etcd.master  7d2h  v1.30.2+k3s2
    server2.acme.com  Ready     control-plane.etcd.master  7d2h  v1.30.2+k3s2
    server3.acme.com  Ready     control-plane.etcd.master  7d2h  v1.30.2+k3s2
    

  2. Add another node to your Linux cluster so that replicas can be distributed across the additional node.

  3. Delete the failed node.

    Run the following command from a control plane node to delete the failed node and relocate its workloads and storage to other nodes in the cluster.

    aiopsctl cluster delete-node <node_name>
    

    Where <node_name> is the name of the failed worker or control plane node.

    You are prompted with a warning that data loss might occur, and a query whether to continue. To avoid data loss, examine the Data stored on node table to check that the replication of your data stores is sufficient such that data loss does not occur if you continue to delete the node. Check that the last column Replicas remaining after deletion does not have 0 for any row, and that an extra warning message to contact IBM Support is not displayed. Type y at the prompt if you want to continue.

    Example output for a cluster where it is safe to continue and remove the node:

    $ aiopsctl cluster delete-node agent1.acme.com                 
    [WARN] Removing node agent1.acme.com. This may cause data loss, read the information below carefully before continuing.
    
    Data stored on node:
    
    Name                          Total replicas  Replicas on node  Replicas remaining after deletion
    Kafka                         3               0                 3
    Identity Management Postgres  2               0                 2
    Zen MinIO                     3               1                 2
    CouchDB                       3               0                 3
    Elasticsearch                 3               0                 3
    Zen Postgres                  2               0                 2
    AIOps Postgres                3               1                 2
    AIOps MinIO                   5               0                 5
    Redis                         3               1                 2
    Cassandra                     3               0                 3
    
    Are you sure you wish to continue? (y/n): y
    
    o- [24 Jul 24 08:44 CDT] Deleting node agent1.acme.com...
    o- [24 Jul 24 08:44 CDT] Recovering datastores...
    
    # Node is removed
    

    Note: If you have only one control plane node, then you cannot delete that node, and aiopsctl prevents you from doing so.

  4. Update the environment variables file aiops_var.sh.

    Edit the shell script aiops_var.sh that you created when you installed IBM Cloud Pak for AIOps on your Linux cluster. This is to maintain the accuracy of the environment variables file.

    For more information about aiops_var.sh, see Create environment variables in Online installation of IBM Cloud Pak for AIOps on Linux or Offline installation of IBM Cloud Pak for AIOps on Linux.

    If you deleted a worker node:

    Remove the entry for the deleted worker node from aiops_var.sh.

    If you deleted a control plane node:

    • If the deleted control plane node is in the ADDITIONAL_CONTROL_PLANES array, then remove the entry for the deleted control plane node.
    • If the deleted control plane node is the CONTROL_PLANE_NODE, then replace the value of CONTROL_PLANE_NODE with the value of the new main control plane node that you created in step 2, if you have not already done so.
  5. If you added and deleted a control plane node, then update your load balancer's configuration.