Restart of all Cassandra pods causes errors for connecting services
If the Cassandra pods restart, then some services may have problems re-connecting.
Problem
When all the Cassandra pods go down simultaneously, the following error is displayed by the cloud native analytics user interface when the pods come back up:An error occurred while fetching data from the server. The response from the server was '500'. Please try again later.
kubectl
get events
also outputs a warning:Warning FailedToUpdateEndpoint Endpoints Failed to update endpoint
Resolution
Use the following procedure to resolve this problem.- Check the state of the Cassandra nodes. From the Cassandra container, use the Cassandra CLI
nodetool
, as in the following example:
Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view)..kubectl exec -ti release_name-cassandra-0 bash [cassandra@m76-cassandra-0 /]$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.1.106.37 636.99 KiB 256 100.0% d439ea16-7b55-4920-a9a3-22e878feb844 rack1
Note: If none of the nodes are in DN status, skip the scaling down steps and proceed to step 8, to restart the pods. - Scale Cassandra down to 0 instances with this command:
Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).kubectl scale --replicas=0 StatefulSet/release_name-cassandra
- Use
kubectl get pods | grep cass
to verify that there are no Cassandra pods running. - Scale Cassandra back up to one
instance.
Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).kubectl scale --replicas=1 StatefulSet/release_name-cassandra
- Use
kubectl get pods | grep cass
to verify that there is one Cassandra pod running. - Repeat step 3, incrementing replicas each time until the required number of Cassandra pods are running. Wait for each Cassandra pod to come up before incrementing the replica count to start another.
- Verify that the cluster is running with this
command:
where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).kubectl exec -ti release_name-cassandra-0 bash [cassandra@m86-cassandra-0 /]$ nodetool status
Expect to seeUN
for all nodes in cluster, as in this example:Datacenter: datacenter1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.1.26.101 598.28 KiB 256 100.0% bbd34cab-9e91-45c1-bfcb-1fe59855d9b3 rack1 UN 10.1.150.13 654.78 KiB 256 100.0% 555f00c8-c43d-4962-a8a0-72eed028d306 rack1 UN 10.1.228.111 560.28 KiB 256 100.0% 8741a69b-acdb-4736-bc74-905d18ebdafa rack1
- Restart the pods that connect to Cassandra with the following
commands
Where <release_name> is the name of your deployment, as specified by the value used for name (Operator Lifecycle Manager UI Form view), or name in the metadata section of the noi.ibm.com_noihybrids_cr.yaml or noi.ibm.com_nois_cr.yaml files (YAML view).kubectl delete release_name-ibm-hdm-analytics-dev-policyregistryservice kubectl delete release_name-ibm-hdm-analytics-dev-eventsqueryservice kubectl delete release_name-ibm-hdm-analytics-dev-archivingservice
- Relogin to the UI.
#4658 Needs triage/work