Troubleshooting datastores

Review frequently encountered issues with the datastores that are used by IBM Cloud Pak® for AIOps.

Elasticsearch

Elasticsearch CA renewal rollout becomes stuck and causes the Policies page to not display policies

Elasticsearch replicas enter a crashing state when the Certificate authority (CA) renewal rollout begins. This is caused by trust issues between Elasticsearch pods that have different versions of the CA.

  1. Run the following command to check your Elasticsearch cluster for this state:

    export AIOPS_PROJECT=<project>
    oc get pod -l ibm-es-server=aiops-ibm-elasticsearch-es-server -n "${AIOPS_PROJECT}"
    

    Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed, usually cp4aiops.

    Example output where the replica aiops-ibm-elasticsearch-es-server-all-2 is crashing:

    aiops-ibm-elasticsearch-es-server-all-0   2/2   Running            0               12h   10.42.12.9    worker-1.acme.com  <none>  <none>
    aiops-ibm-elasticsearch-es-server-all-1   2/2   Running            0               12h   10.42.5.35    worker-2.acme.com  <none>  <none>
    aiops-ibm-elasticsearch-es-server-all-2   1/2   CrashLoopBackOff   59 (2m4s ago)   10h   10.42.11.36   worker-3.acme.com  <none>  <none>
    

    If one or more replicas have a state of CrashLoopBackOff then proceed to the next steps to confirm the cause.

  2. Run the following command to check whether the crashing pod's logs indicate that the crashing pod failed to join the Elasticsearch cluster:

    oc logs <crashing_pod> -n "${AIOPS_PROJECT}" -c elasticsearch | grep "cluster-manager not discovered"
    

    Where <crashing_pod> is the name of the pod that crashed. For example, aiops-ibm-elasticsearch-es-server-all-2

    Example output where the crashing pod failed to join the Elasticsearch cluster:

    [2025-01-02T23:18:14,045][WARN ][o.o.c.c.ClusterFormationFailureHelper] [aiops-ibm-elasticsearch-es-server-all-2] cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids from [ZtUu55QzRBq7u7LIMgvTEA, 8H_HGLquRZSy3Qqw1TtgeA, ot0uVCmTRr-l8sa6M4oaDg], have discovered [{aiops-ibm-elasticsearch-es-server-all-2}{ot0uVCmTRr-l8sa6M4oaDg}{j3V12hDBSAK9LhjNEbOGgg}{127.0.0.1}{127.0.0.1:9702}{dim}{shard_indexing_pressure_enabled=true}] which is not a quorum; discovery will continue using [127.0.0.1:9700, 127.0.0.1:9701] from hosts providers and [{aiops-ibm-elasticsearch-es-server-all-2}{ot0uVCmTRr-l8sa6M4oaDg}{j3V12hDBSAK9LhjNEbOGgg}{127.0.0.1}{127.0.0.1:9702}{dim}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 16, last-accepted version 33 in term 16
    

    If your crashing pod's logs have output similar to the preceding example output, then proceed to the next step.

  3. Run the following command to check whether the Elasticsearch HAProxy logs indicate SSL handshake failures:

    oc logs <crashing_pod> -n "${AIOPS_PROJECT}" -c haproxy | grep "SSL handshake failure"
    

    Where <crashing_pod> is the name of the pod that crashed. For example, aiops-ibm-elasticsearch-es-server-all-2

    Example output:

    127.0.0.1:53756 [20/Dec/2024:18:42:28.873] ft_elastic_transport_local_all_2 bk_elastic_transport_all_2/<NOSRV> -1/-1/0 0 SC 3/1/0/0/0 0/0
    10.42.4.57:51244 [20/Dec/2024:18:42:28.994] ft_elastic_http/1: SSL handshake failure
    10.42.5.35:39470 [20/Dec/2024:18:42:29.124] ft_elastic_transport/1: SSL handshake failure
    10.42.4.57:51250 [20/Dec/2024:18:42:29.201] ft_elastic_http/1: SSL handshake failure
    

    If your logs have output similar to the preceding example output, then use the following solution to rectify this issue.

Solution: Run the following command to restart all three Elasticsearch pods at the same time, to help ensure that they all mount the new CA certificate.

oc delete pod -l ibm-es-server=aiops-ibm-elasticsearch-es-server -n "${AIOPS_PROJECT}"

Example output:

pod "aiops-ibm-elasticsearch-es-server-all-0" deleted
pod "aiops-ibm-elasticsearch-es-server-all-1" deleted
pod "aiops-ibm-elasticsearch-es-server-all-2" deleted

Your elasticsearch replicas will restart and mount the new CA certificate.