Troubleshooting clusters and nodes

Learn how to isolate and resolve problems that involve cluster and node failures.

Temporal grouping policies not created after training historic events

Groups of related alerts that occur together (temporal groups) are not correlated automatically by the system. This issue can occur when temporal grouping policies are not created automatically due to communication issues between the aiops-ir-core-archiving pod and some Kafka topics.

Solution: Complete the following steps:

  1. Run the following command, while logged in as an admin user in the project (namespace) where IBM Cloud Pak® for AIOps is running:

    kubectl logs $(kubectl get pods|grep aiops-ir-core-archiving|grep -v setup|awk '{print $1}') | grep "lib-rdkafka status"| tail -1
    

    Example command output:

    {
    "name": "client.kafka",
    "hostname": "aiops-ir-core-archiving-5dfd8fdcc5-l52g8",
    "pid": 20,
    "level": 30,
    "brokerStates": {
       "0": "UP",
       "-1": "UP"
    },
    "partitionStates": {
       "cp4waiops-cartridge.irdatalayer.replay.alerts.0": "UP",
       "cp4waiops-cartridge.irdatalayer.replay.alerts.1": "UP",
       "cp4waiops-cartridge.irdatalayer.replay.alerts.2": "UP",
       "cp4waiops-cartridge.irdatalayer.replay.alerts.3": "UP",
       "cp4waiops-cartridge.irdatalayer.replay.alerts.4": "UP",
       "cp4waiops-cartridge.irdatalayer.replay.alerts.5": "UP"
    },
    "msg": "lib-rdkafka status",
    "time": "2021-10-19T11:00:49.852Z",
    "v": 0
    }
    
  2. If any of the broker or partition states are not up, restart the aiops-ir-core-archiving pod with the following command:

    kubectl delete pod $(kubectl get pods|grep aiops-ir-core-archiving|grep -v setup|awk '{print $1}')
    

Illegal reflective access operation in Jobmanager pod log

Reports of illegal reflective access operation.

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.13.2.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

This warning is reported when the Flink library is run by Java 11. For more information, see https://issues.apache.org/jira/browse/FLINK-17524

Solution: No action required. This warning message can be ignored and does not impact functionality.

Kafka pods fail to restart

Kafka pods fail to start when a worker node goes down, as in the following example:

evtmanager-kafka-0                                                   1/2     Running            55         15h
evtmanager-kafka-1                                                   2/2     Running            0          16h
evtmanager-kafka-2                                                   1/2     Running            0          15h
evtmanager-kafka-3                                                   1/2     Running            0          15h
evtmanager-kafka-4                                                   1/2     Running            0          15h
evtmanager-kafka-5                                                   2/2     Running            0          15h
evtmanager-zookeeper-0                                               1/1     Running            0          16h
evtmanager-zookeeper-1                                               1/1     Running            0          15h
evtmanager-zookeeper-2                                               1/1     Running            0          15h

If the worker node where the Kafka pods are located goes down, the Kafka pods might fail to restart on another node.

Solution: Restart all zookeeper and Kafka pods:

oc delete pod {evtmanager-zookeeper-0} {evtmanager-zookeeper-1} {evtmanager-zookeeper-n} {evtmanager-kafka-0} {evtmanager-kafka-1} {evtmanager-kafka-n}

Cloud Pak for AIOps console not accessible after shutdown of backup cluster

Unable to access the IBM Cloud Pak for AIOps console on the backup cluster after a systematic shut down and start up of the backup cluster. The following error message is displayed:

"Document was encrypted with unknown key '<namespace>-<release name>-<timestamp>'"

Solution: Find and then delete the required databases through the API.

  1. First, run the following command to retrieve the hostname:

    oc get route couchdb-georedundancy
    

    For this scenario, use the hostname couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com.

  2. Run the two SECRET_NAME commands to retrieve the username and password that are required to retrieve the database names.

    For Username:

    SECRET_NAME=$(oc get secret | grep couchdb-secret | awk '{ print $1 }'); kubectl get secret ${SECRET_NAME} -o json | grep "username" | cut -d : -f2 | cut -d '"' -f2 | base64 -d
    username
    

    For Password:

    SECRET_NAME=$(oc get secret | grep couchdb-secret | awk '{ print $1 }'); kubectl get secret ${SECRET_NAME} -o json | grep "password" | cut -d : -f2 | cut -d '"' -f2 | base64 -d
    password
    
  3. Use the hostname, username, and password obtained in the previous steps to run the command to retrieve the database names:

    curl -sk -u "username:password" https://couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com/_all_dbs | jq "."
    

    Example output:

    [
    "_global_changes",
    "_replicator",
    "_users",
    "collabopsuser",
    "emailrecipients",
    "genericproperties",
    "icp-63666439356237652d336263372d343030362d613461382d613733613739633731323535-rba-as",
    "icp-63666439356237652d336263372d343030362d613461382d613733613739633731323535-rba-rbs",
    "integration",
    "noi-drdb",
    "noi-osregdb",
    "normalizercfd95b7e-3bc7-4006-a4a8-a73a79c71255",
    "osb-map",
    "otc_omaas_broker",
    "rba-pdoc",
    "schedule",
    "tenant",
    "trainingjob"
    ]
    

    The relevant database name here is collabopsuser.

  4. Run the following command to delete the database:

    curl -sk -u "root:netcool" -X DELETE "https://couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com/${DATABASE_NAME}"
    

    Where DATABASE_NAME = collabopsuser

  5. Restart the cem-users pod to recreate the database:

    oc get pods |grep cem-users
    <cem_users_pod_name>
    
    oc delete pod <cem_users_pod_name>
    
  6. You can confirm the database is created by running the curl command from step 3:

    curl -sk -u "username:password" https://couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com/_all_dbs | jq "."
    

LAD Kafka intermittent Issue, as Log anomaly detector abruptly stops making predictions

The log anomaly detector abruptly stops making predictions.

On checking the anomaly pod logs, you see error messages related to Kafka, and due to these intermittent issues, log anomaly stops consuming data from the windowed logs topic, even when there are continuous windows generated from the data preprocessing component.

An example of such an error in the logs is as follows:

[2022-03-19 14:28:34,233] [logprophet.logs.watson_aiops] [ERROR] An error has occurred while processing message in batch mode. Error: 'NoneType' object is not iterable
[2022-03-19 14:28:34,234] [app.anomaly_detector.service_controller] [ERROR] Consumer failed with unhandled exception : 'NoneType' object is not iterable

Solution: There is a two-step process to resume the flow:

  1. Clear the windowed logs topic content. This step is required to clear any old data that is not yet consumed by the log anomaly detector. If you do not clear this data, when LAD is ready to make predictions, it will try to generate events for old data showing a lag. So, you need to disable the data flow, and clear data from windowed logs.

    • Disable data flow from data integration

    • Clear the windowed logs topic using following command

      oc edit kt cp4waiops-cartridge-windowed-logs-1000-1000
      
    • Set retention.ms to 1.

    • Wait for a few minutes, and once the data is cleared and no more windows are generated, reset the 'retention.ms' to the original value

    • Enable the data flow in the data integration

  2. Once the data is cleared using the above steps, restart the anomaly pod logs with the following command:

    oc get po | grep anomaly
    oc delete pod <anomaly-pod>
    

EDB Postgres cluster failure

Run the following command on your cluster:

oc get clusters.postgresql.k8s.enterprisedb.io <installation_name>-edb-postgres -n <namespace>

Where:

  • <installation_name> is the name of your IBM Cloud Pak for AIOps instance, for example ibm-cp-aiops.
  • <namespace> is the namespace that IBM Cloud Pak for AIOps is deployed in.

If the cluster STATUS does not show Cluster in healthy state and the READY count is 0, then the EDB Postgres cluster has failed.

EDB Postgres is a component that is used by IBM Cloud Pak for AIOps. A defect in EDB Postgres 1.18.6 and lower can cause the EDB Postgres cluster to fail. If the pod that hosts the primary instance of EDB Postgres is deleted and there is not a healthy EDB Postgres replica instance available to promote, then the EDB Postgres operator may not reschedule the pod hosting the primary instance.

Solution: If you encounter this problem, then you must contact IBM Support who can help you to determine and bring up the primary EDB Postgres node.

Postgres EDB database crashes

After a failure of one of the Postgres EDB replicas, a new replica is unable to come up, and the failed replica remains.

Solution: Delete the failed replica and its persistent volume claim (PVC) if the replica failure is due to a synchronization problem.

  1. Run the following command to find all Postgres instances that are not running:

    export PROJECT_CP4AIOPS=<project>
    oc get pod -n $PROJECT_CP4AIOPS -l "k8s.enterprisedb.io/podRole=instance" | grep -vE "Running" 
    

    Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed, usually cp4aiops.

    Example output where zen-metastore-edb-1 has failed:

    oc get pod -n cpaiops -l "k8s.enterprisedb.io/podRole=instance" | grep -vE "Running" 
    
    NAMESPACE NAME                  READY   STATUS             RESTARTS        AGE 
    aiops     zen-metastore-edb-1   0/1     CrashLoopBackOff   904 (36s ago)   7d1h
    

  2. Examine the logs for the failed replica to see whether the non-primary replicas are not synchronized with the WAL logs in the primary replica.

    oc logs -n "${PROJECT_CP4AIOPS}" <failed_replica>
    

    Where <failed_replica> is the failed replica returned in step 1, for example zen-metastore-edb-1.

    Look for entries similar to the following examples:

    pg_rewind: servers diverged at WAL location 5/6606A4E0 on timeline 1
    pg_rewind: error: could not open file \"/var/lib/postgresql/data/pgdata/pg_wal/000000010000000500000068\": No such file or directory
    pg_rewind: fatal: could not read WAL record at 5/68000028
    

    {"level":"info","ts":1677595436.8679705,"logger":"postgres","msg":"record","logging_pod":"ibm-cp-aiops-edb-postgres-1","record":{"log_time":"2023-02-28 14:43:56.867 UTC","user_name":"streaming_replica","process_id":"90988","connection_from":"10.254.24.22:32852","session_id":"63fe132c.1636c","session_line_num":"1","command_tag":"idle","session_start_time":"2023-02-28 14:43:56 UTC","virtual_transaction_id":"13/0","transaction_id":"0","error_severity":"ERROR","sql_state_code":"58P01","message":"requested WAL segment 0000000300000002000000A0 has already been removed","query":"START_REPLICATION 2/A0000000 TIMELINE 2","application_name":"ibm-cp-aiops-edb-postgres-2","backend_type":"walsender"}}
    

  3. Check that the failed replica is not the primary replica.

    Run the following command, and check that the failed replica identified in step 1 is not listed in the PRIMARY column for any of the Postgres clusters.

    oc get clusters.postgresql.k8s.enterprisedb.io -n $PROJECT_CP4AIOPS
    

    Example output:

    oc get clusters.postgresql.k8s.enterprisedb.io -n cp4aiops
    NAME                        AGE   INSTANCES   READY   STATUS                     PRIMARY
    common-service-db           23h   1           1       Cluster in healthy state   common-service-db-1
    ibm-cp-aiops-edb-postgres   23h   1           1       Cluster in healthy state   ibm-cp-aiops-edb-postgres-1
    zen-metastore-edb           23h   2           2       Cluster in healthy state   zen-metastore-edb-2
    

  4. Delete the failed replica and its PVC.

    Run the following command if the failed replica is not the primary and the logs for the failed replica showed errors similar to the example output in step 2. Otherwise, contact IBM Support.

    oc delete pod,pvc zen-metastore-edb-1 -n "${PROJECT_CP4AIOPS}"
    

    Where <failed_replica> is the failed replica returned in step 1, for example zen-metastore-edb-1.

CouchDB pod reports a permission error when draining a node

The CouchDB pod's log reports a permission error similar to the following example:

[error] 2022-12-02T01:24:20.726024Z couchdb@c-example-couchdbcluster-m-0.c-example-couchdbcluster-m <0.13913.7> -------- Could not open file /data/db/shards/00000000-7fffffff/_users.1669801403.couch: permission denied

Note: After a restore into a new namespace, similar symptoms can be seen. For more information about the solution for this scenario, see Restore fails with CouchDB pod in CrashLoopBackOff.

The way that Kubernetes mounts persistent volumes can cause this problem.

Solution: Delete the CouchDB pods that are not ready. When the replacement pods are created, the persistent volumes are mounted correctly.

  1. Find the names of your couchdb pods.

    oc get pods -l app.kubernetes.io/name=IssueResolutionCoreCouchDB -n <namespace>
    

    Where <namespace> is the namespace that IBM Cloud Pak for AIOps is deployed in.

  2. Delete all the couchdb pods that the previous step showed as not ready.

    oc delete pod <couchdb-pod> -n <namespace>
    

    Where:

    • <namespace> is the namespace that IBM Cloud Pak for AIOps is deployed in.
    • <couchdb-pod> is a couchdb pod that is not ready.

Kafka broker pods are unhealthy because Kafka PVCs are full

When the Kafka broker pods are unhealthy due to insufficient storage, the Events operator is unable to expand the PVCs.

Solution: Check whether the Kafka broker pods have storage available, and expand the storage available to Kafka if they do not.

  1. Check the state of the Kafka broker pods:

    export AIOPS_NAMESPACE=<project>
    oc get pod -n "${AIOPS_NAMESPACE}" -l app.kubernetes.io/name=kafka
    

    Where <project> is the namespace (project) that IBM Cloud Pak for AIOps is deployed.

    Example output:

    iaf-system-kafka-0                                                1/1     Running                0                5d21h
    iaf-system-kafka-1                                                0/1     CreateContainerError   0                95m
    iaf-system-kafka-2                                                0/1     CreateContainerError   1 (2d18h ago)    5d21h 
    

  2. If the broker pods are in a CreateContainerError state, check the logs for errors:

    oc logs -n "${AIOPS_NAMESPACE}" iaf-system-kafka-1
    

    Example output:

    Warning  Failed          5m40s (x13 over 89m)  kubelet  Error: relabel failed /var/lib/kubelet/pods/37e6519a-b372-429d-b5f5-9aaf5b12fd37/volumes/kubernetes.io~csi/pvc-ff53e966-fee1-41e2-9448-b0eeac2e49c6/mount: lsetxattr /var/lib/kubelet/pods/37e6519a-b372-429d-b5f5-9aaf5b12fd37/volumes/kubernetes.io~csi/pvc-ff53e966-fee1-41e2-9448-b0eeac2e49c6/mount/kafka-log1/.lock: no space left on device
    

    Do not proceed with the rest of these troubleshooting steps if the error is not related to disk space.

  3. Adjust the Kafka storage size in the CustomResource:

    oc patch kafka/iaf-system -n "${AIOPS_NAMESPACE}" --type merge -p '{"spec":{"kafka":{"storage":{"size":"<size>Gi"}}}}' 
    

    Where <size> is the increased storage size that you require for the Kafka PVCs.

  4. Increase the Kafka PVCs:

    for p in $(oc get pvc -l app.kubernetes.io/name=kafka -n "${AIOPS_NAMESPACE}" -o jsonpath='{.items[*].metadata.name}'); do oc patch pvc $p -n "${AIOPS_NAMESPACE}" --type merge -p '{"spec":{"resources":{"requests":{"storage":"<size>Gi"}}}}' ; done
    

    Where <size> is the same value that was used in the previous command.

    Note : The time required for this command to run depends on the number of PVCs and the amount of data that is stored in each one.

aimanager is not in Ready state after cluster restart, integrations are missing from the console

After a cluster restart, aimanager does not have a ready state when queried with oc get aimanager -o yaml, and some previously configured integrations such as log integrations, ChatOps, runbooks, kafka, or PagerDuty are missing from the console. This problem can be caused by a timing issue where the aimanager pod attempts to start before the Postgres database is ready.

Solution

  1. Run the following command to check if the aimanager-aio-controller pod is running.

    oc get pods | grep aimanager-aio-controller
    

    Example output for a pod that is running correctly:

    aimanager-aio-controller-67d6857cbb-n9skp     1/1     Running     0               7d3h
    

  2. If the aimanager-aio-controller is not running, then check the pod's logs for errors.

    oc logs <aimanager-aio-controller-pod-name>
    

    Where <aimanager-aio-controller-pod-name> is the name of the aimanager-aio-controller-pod-name from step 1.

    Example error output:

    2024-12-03 04:51:14 ERROR ServletListener:801 - Failed to get edb connection from SQL query 
    java.sql.SQLException: Cannot create PoolableConnectionFactory (Connection to ibm-cp-aiops-edb-postgres-rw:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.)
        at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:653) ~[commons-dbcp2-2.9.0.jar:2.9.0]
    

    If you have an error similar to the preceding example error output, then note down the name of the aimanager-aio-controller pod and use the following steps to resolve the problem.

  3. Check that the Postgres database is up and running, and do not continue until it is.

    oc get pods | grep edb-postgres
    

    Example output if PostgreSQL is running:

    aiops-installation-edb-postgres-1     1/1     Running     0               6d2h
    

  4. Run the following command to restart the aimanager-aio-controller pod.

    oc delete pod <aimanager-aio-controller-pod-name>
    

    Where <aimanager-aio-controller-pod-name> is the name of the aimanager-aio-controller-pod-name from step 1.

Deployment on Linux: unable to access Cloud Pak for AIOps console

The pods in the IBM Cloud Pak for AIOps cluster are in a Running state, but the user interface is not accessible.

Solution: When you installed IBM Cloud Pak for AIOps, you used an existing load balancer or configured a new one of your choice. For more information, see Load balancing. The load balancer is the entry point for accessing deployments of IBM Cloud Pak for AIOps on Linux. Problems with your load balancer can prevent access to the Cloud Pak for AIOps console. Check the load balancer status and logs for problems, and restart the load balancer if it is not running.

Deployment on Linux: API server is down after a cluster restart

After restarting the Linux cluster, all requests to the Kubernetes API server such as oc get pod and aiopsctl status fail, as in the following example output:

$ oc get pod -n aiops
E0903 08:55:39.586644    1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.589138    1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.592133    1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.594610    1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.597172    1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
Error from server (ServiceUnavailable): the server is currently unable to handle the request 

$ aiopsctl status
o- [03 Sep 24 08:58 PDT] Getting cluster status
[ERROR] Failed to get cluster status apiserver not ready  

Solution: Run the following steps to confirm that the problem is due to the Kubernetes API server being down, and to rectify this.

  1. Check the control plane nodes.

    1. On each control plane node, run the following command to assess the platform state. Affected deployments have a state of activating.

      systemctl status k3s.service 
      

      Example output for a deployment affected by this issue:

      $ systemctl status k3s.service 
      
      k3s.service - Lightweight Kubernetes    Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)    Active: activating (start) since Mon 2024-09-09 05:10:30 PDT; 6s ago
      

    2. On each control plane node, run the following command to check the platform logs:

      journalctl -xeu k3s.service 
      

      Example output for a deployment affected by this issue:

      Sep 09 05:11:28 control-plane-2.acme.com k3s[50701]: {"level":"warn","ts":"2024-09-09T05:11:28.735466-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"8c10b42dfa5ffc54","rem>
      Sep 09 05:11:28 control-plane-2.acme.com k3s[50701]: {"level":"warn","ts":"2024-09-09T05:11:28.828015-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"8c10b42dfa5ffc54","rem>
      Sep 09 05:11:28 control-plane-2.acme.com k3s[50701]: {"level":"warn","ts":"2024-09-09T05:11:28.834347-0700","caller":"rafthttp/http.go:145","msg":"failed to process Raft message","local-member-id":"8c10b42dfa5ffc54","error":"ra> 
      

    If your deployment is affected, continue with the following steps.

  2. Stop k3s on the primary control plane node:

    systemctl stop k3s
    

    Note: The primary control plane is the first control plane attached to the cluster. This is the value of CONTROL_PLANE_NODE in your aiops_var.sh environment variables file. The non-primary control plane nodes are the nodes in the ADDITIONAL_CONTROL_PLANE_NODES array in your aiops_var.sh environment variables file.

  3. Run a cluster reset on the primary control plane node to soft reset etcd.

    k3s server --cluster-reset
    
  4. Stop k3s on the non-primary control plane nodes.

    systemctl stop k3s
    
  5. Back up the etcd data directory on each non-primary control plane node.

    cp -r <data_directory> <backup_directory>
    

    Where

    • <data_directory> is: /var/lib/rancher/k3s/server/db if you are not using a custom platform directory, or PLATFORM_STORAGE_PATH in aiops_var.sh if you are using a custom platform directory
    • <backup_directory> is a directory where you want to store the backup.
  6. Delete the etcd data directory on each non-primary control plane node.

    Important: Do not delete the platform data directory on the primary control plane node.

    rm -rf <data_directory>
    

    Where <data_directory> is /var/lib/rancher/k3s/server/db if you are not using a custom platform directory, or PLATFORM_STORAGE_PATH in aiops_var.sh if you are using a custom platform directory.

  7. Run the following command on the primary control plane node, and then on each of the non-primary control plane nodes:

    systemctl start k3s