Replacing a Cassandra node in your OCP cluster

Agile Service Manager configuration for Cassandra sets a three-node cluster with a replication factor of three, which means your deployment would still be fully functional should you lose one node. To mitigate the risk of a second node failing, however, you would perform the steps documented here to restore the level of service to a three node Cassandra cluster.

About this task

Important: These steps provide basic instructions on how to diagnose and recover from the described situation. However, it is likely that in your specific situation you may encounter differences in how your OCP cluster or Cassandra behave under such a failure, in which case you should engage your company Cassandra administrators.

Procedure

Verify the state of your Cassandra cluster

  1. Authenticate into the Kubernetes namespace where Agile Service Manager is deployed as part of your solution.
  2. Check the status of your Cassandra pods.
    The following example system output shows that a pod is being terminated because of a node on the cluster failing:
    NAME    READY   STATUS        RESTARTS   AGE
    noi-cassandra-0    1/1     Running       1          4d2h
    noi-cassandra-1    1/1     Running       1          4d2h
    noi-cassandra-2    1/1     Terminating   1          4d2h 
    
  3. Verify that the Cassandra node is down.
    For the pods that are still running, use a command like the following example:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool status -r"
    Sample system output, where the line starting with 'D' indicates a node that is down.:
    Datacenter: datacenter1
    =======================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
    UN  noi-cassandra-0.noi-cassandra.<project>.svc.cluster.local  64.34 MiB  256          100.0%            f6e6f151-ca7b-4117-be87-97245e61d7e9  rack1
    
    UN  noi-cassandra-1.noi-cassandra.<project>.svc.cluster.local 64.34 MiB  256          100.0%            989027b6-896b-4622-b282-9aa1dc2d9e39  rack1
    
    DN  10.254.4.4    64.31 MiB  256          100.0%            ce054185-d72a-4b48-9c34-e8199b6e1559  rack1
    

Restore a three-node Cassandra cluster

  1. Authenticate into the Kubernetes namespace where Agile Service Manager is deployed as part of your solution.
  2. Remove the Cassandra node that is down from the Cassandra cluster before the new one comes online.
    Use a command like the following example (using your node ID).

    This command runs Cassandra's nodetool utility in one of the nodes still up (in this case noi-cassandra-0) to remove the Cassandra node that is marked as down.

    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool removenode ce054185-d72a-4b48-9c34-e8199b6e1559"
  3. Confirm the deletion using a command like the following example against one of the running pods:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool status -r"
  4. Delete the pod that was running in the lost node reporting Terminating state:
    kubectl delete pod noi-cassandra-2 --grace-period=0 --force
    Tip: You may not need to delete the pod, depending on your cluster configuration and the outcome of your investigation. It will, however, be necessary to delete it if the pod is permanently stuck in the 'Terminating' state, as in this example.
  5. Bring the new node online in your OCP cluster.
    The container is initialized to join the Cassandra cluster, replacing the removed node.
    Troubleshooting: Check whether the newly added Cassandra node lists itself as a seed. Run the following command to check the seeds configured. In this example, the added node is noi-cassandra-2 pod.
    kubectl exec noi-cassandra-2 -- bash -c "grep seeds: /opt/ibm/cassandra/conf/cassandra.yaml"
    System output example listing the seeds configured for the node:
        - seeds: "10.254.12.2,10.254.8.7"
    If the newly added node is listing itself as a seed, it can report inconsistent information with unexpected results, should it be the node queried by the Agile Service Manager services. To limit the potential impact, follow the Preparing your system for data restoration steps to stop your Agile Service Manager services from accessing your data until you have stabilized the new node.
  6. Perform a full repair.
    The following command instructs Cassandra to perform a repair of the data:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool repair --full"
    Tip: For large data sets, it is preferred that you run the previous repair command several times for a limited range of tokens each time. You can get a list of tokens with the following command:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool info --tokens"
    Example system output:
    ID                     : e2494466-0cc9-4268-a5f8-5d5fa363faaa
    …
    Token                  : -9026954462746495840
    Token                  : -8998340199710379626
    …
    Token                  : 9099714334544743528
    Token                  : 9120502118133589206
    
    Run the repair command several times, each time specifying a different range of tokens, for example:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool repair --start-token -9026954462746495840 --end-token -8998340199710379626"
    You can check the progress of the repair in the Cassandra pods logs.
  7. Perform a checksum verification of your data.
    For example:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool verify -e"
    If any errors are returned, repeat the repair step until the checksum verification no longer returns any errors.