Troubleshooting datastores
Review frequently encountered issues with the datastores that are used by IBM Cloud Pak® for AIOps.
Elasticsearch
Elasticsearch CA renewal rollout becomes stuck and causes the Policies page to not display policies
Elasticsearch replicas enter a crashing state when the Certificate authority (CA) renewal rollout begins. This is caused by trust issues between Elasticsearch pods that have different versions of the CA.
-
Run the following command to check your Elasticsearch cluster for this state:
export AIOPS_PROJECT=<project> oc get pod -l ibm-es-server=aiops-ibm-elasticsearch-es-server -n "${AIOPS_PROJECT}"Where
<project>is the project (namespace) where IBM Cloud Pak for AIOps is deployed, usuallycp4aiops.Example output where the replica
aiops-ibm-elasticsearch-es-server-all-2is crashing:aiops-ibm-elasticsearch-es-server-all-0 2/2 Running 0 12h 10.42.12.9 worker-1.acme.com <none> <none> aiops-ibm-elasticsearch-es-server-all-1 2/2 Running 0 12h 10.42.5.35 worker-2.acme.com <none> <none> aiops-ibm-elasticsearch-es-server-all-2 1/2 CrashLoopBackOff 59 (2m4s ago) 10h 10.42.11.36 worker-3.acme.com <none> <none>
If one or more replicas have a state of
CrashLoopBackOffthen proceed to the next steps to confirm the cause. -
Run the following command to check whether the crashing pod's logs indicate that the crashing pod failed to join the Elasticsearch cluster:
oc logs <crashing_pod> -n "${AIOPS_PROJECT}" -c elasticsearch | grep "cluster-manager not discovered"Where
<crashing_pod>is the name of the pod that crashed. For example,aiops-ibm-elasticsearch-es-server-all-2Example output where the crashing pod failed to join the Elasticsearch cluster:
[2025-01-02T23:18:14,045][WARN ][o.o.c.c.ClusterFormationFailureHelper] [aiops-ibm-elasticsearch-es-server-all-2] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [ZtUu55QzRBq7u7LIMgvTEA, 8H_HGLquRZSy3Qqw1TtgeA, ot0uVCmTRr-l8sa6M4oaDg], have discovered [{aiops-ibm-elasticsearch-es-server-all-2}{ot0uVCmTRr-l8sa6M4oaDg}{j3V12hDBSAK9LhjNEbOGgg}{127.0.0.1}{127.0.0.1:9702}{dim}{shard_indexing_pressure_enabled=true}] which is not a quorum; discovery will continue using [127.0.0.1:9700, 127.0.0.1:9701] from hosts providers and [{aiops-ibm-elasticsearch-es-server-all-2}{ot0uVCmTRr-l8sa6M4oaDg}{j3V12hDBSAK9LhjNEbOGgg}{127.0.0.1}{127.0.0.1:9702}{dim}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 16, last-accepted version 33 in term 16
If your crashing pod's logs have output similar to the preceding example output, then proceed to the next step.
-
Run the following command to check whether the Elasticsearch HAProxy logs indicate SSL handshake failures:
oc logs <crashing_pod> -n "${AIOPS_PROJECT}" -c haproxy | grep "SSL handshake failure"Where
<crashing_pod>is the name of the pod that crashed. For example,aiops-ibm-elasticsearch-es-server-all-2Example output:
127.0.0.1:53756 [20/Dec/2024:18:42:28.873] ft_elastic_transport_local_all_2 bk_elastic_transport_all_2/<NOSRV> -1/-1/0 0 SC 3/1/0/0/0 0/0 10.42.4.57:51244 [20/Dec/2024:18:42:28.994] ft_elastic_http/1: SSL handshake failure 10.42.5.35:39470 [20/Dec/2024:18:42:29.124] ft_elastic_transport/1: SSL handshake failure 10.42.4.57:51250 [20/Dec/2024:18:42:29.201] ft_elastic_http/1: SSL handshake failure
If your logs have output similar to the preceding example output, then use the following solution to rectify this issue.
Solution: Run the following command to restart all three Elasticsearch pods at the same time, to help ensure that they all mount the new CA certificate.
oc delete pod -l ibm-es-server=aiops-ibm-elasticsearch-es-server -n "${AIOPS_PROJECT}"
Example output:
pod "aiops-ibm-elasticsearch-es-server-all-0" deleted
pod "aiops-ibm-elasticsearch-es-server-all-1" deleted
pod "aiops-ibm-elasticsearch-es-server-all-2" deleted
Your elasticsearch replicas will restart and mount the new CA certificate.