Restart of all Cassandra pods causes errors for connecting services

If the Cassandra pods restart, then some services may have problems re-connecting.

Problem

When all the Cassandra pods go down simultaneously, the following error is displayed by the cloud native analytics user interface when the pods come back up:
An error occurred while fetching data from the server. The response from the server was '500'. Please try again later.
kubectl get events also outputs a warning:
Warning FailedToUpdateEndpoint Endpoints Failed to update endpoint

Resolution

Use the following procedure to resolve this problem.
  1. Check the state of the Cassandra nodes. From the Cassandra container, use the Cassandra CLI nodetool, as in the following example:
    kubectl exec -ti release_name-cassandra-0 bash
    [cassandra@m76-cassandra-0 /]$ nodetool status
    Datacenter: datacenter1
    =======================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
    UN  10.1.106.37  636.99 KiB  256          100.0%            d439ea16-7b55-4920-a9a3-22e878feb844  rack1
    Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view)..
    Note: If none of the nodes are in DN status, skip the scaling down steps and proceed to step 8, to restart the pods.
  2. Scale Cassandra down to 0 instances with this command:
    kubectl scale --replicas=0 StatefulSet/release_name-cassandra
    Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).
  3. Use kubectl get pods | grep cass to verify that there are no Cassandra pods running.
  4. Scale Cassandra back up to one instance.
    kubectl scale --replicas=1 StatefulSet/release_name-cassandra
    Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).
  5. Use kubectl get pods | grep cass to verify that there is one Cassandra pod running.
  6. Repeat step 3, incrementing replicas each time until the required number of Cassandra pods are running. Wait for each Cassandra pod to come up before incrementing the replica count to start another.
  7. Verify that the cluster is running with this command:
    kubectl exec -ti release_name-cassandra-0 bash
    [cassandra@m86-cassandra-0 /]$ nodetool status
    where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).
    Expect to see UN for all nodes in cluster, as in this example:
    Datacenter: datacenter1
    
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    -- Address Load Tokens Owns (effective) Host ID Rack
    UN 10.1.26.101 598.28 KiB 256 100.0% bbd34cab-9e91-45c1-bfcb-1fe59855d9b3 rack1
    UN 10.1.150.13 654.78 KiB 256 100.0% 555f00c8-c43d-4962-a8a0-72eed028d306 rack1
    UN 10.1.228.111 560.28 KiB 256 100.0% 8741a69b-acdb-4736-bc74-905d18ebdafa rack1
  8. Restart the pods that connect to Cassandra with the following commands
    kubectl delete release_name-ibm-hdm-analytics-dev-policyregistryservice
    kubectl delete release_name-ibm-hdm-analytics-dev-eventsqueryservice
    kubectl delete release_name-ibm-hdm-analytics-dev-archivingservice
    Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).
  9. Relogin to the UI.
Draft comment: LOUISERoberts
#4658 Needs triage/work