Troubleshooting clusters and nodes
Learn how to isolate and resolve problems that involve cluster and node failures.
- Temporal grouping policies not created after training historic events
- Illegal reflective access operation in Jobmanager pod log
- Kafka pods fail to restart
- Cloud Pak for AIOps console not accessible after shutdown of backup cluster
- Log anomaly detection Kafka intermittent Issue, as Log anomaly detector abruptly stops making predictions
- EDB Postgres cluster failure
- Postgres EDB database crashes
- CouchDB pod reports a permission error when draining a node
- Kafka broker pods are unhealthy because Kafka PVCs are full
- aimanager is not in Ready state after cluster restart, integrations are missing from the console
- Deployment on Linux: unable to access Cloud Pak for AIOps console
- Deployment on Linux: API server down after cluster restart
Temporal grouping policies not created after training historic events
Groups of related alerts that occur together (temporal groups) are not correlated automatically by the system. This issue can occur when temporal grouping policies are not created automatically due to communication issues between the aiops-ir-core-archiving pod and some Kafka topics.
Solution: Complete the following steps:
-
Run the following command, while logged in as an admin user in the project (namespace) where IBM Cloud Pak® for AIOps is running:
kubectl logs $(kubectl get pods|grep aiops-ir-core-archiving|grep -v setup|awk '{print $1}') | grep "lib-rdkafka status"| tail -1
Example command output:
{ "name": "client.kafka", "hostname": "aiops-ir-core-archiving-5dfd8fdcc5-l52g8", "pid": 20, "level": 30, "brokerStates": { "0": "UP", "-1": "UP" }, "partitionStates": { "cp4waiops-cartridge.irdatalayer.replay.alerts.0": "UP", "cp4waiops-cartridge.irdatalayer.replay.alerts.1": "UP", "cp4waiops-cartridge.irdatalayer.replay.alerts.2": "UP", "cp4waiops-cartridge.irdatalayer.replay.alerts.3": "UP", "cp4waiops-cartridge.irdatalayer.replay.alerts.4": "UP", "cp4waiops-cartridge.irdatalayer.replay.alerts.5": "UP" }, "msg": "lib-rdkafka status", "time": "2021-10-19T11:00:49.852Z", "v": 0 }
-
If any of the broker or partition states are not up, restart the aiops-ir-core-archiving pod with the following command:
kubectl delete pod $(kubectl get pods|grep aiops-ir-core-archiving|grep -v setup|awk '{print $1}')
Illegal reflective access operation in Jobmanager pod log
Reports of illegal reflective access operation.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil (file:/opt/flink/lib/flink-dist_2.11-1.13.2.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of org.apache.flink.shaded.akka.org.jboss.netty.util.internal.ByteBufferUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
This warning is reported when the Flink library is run by Java 11. For more information, see https://issues.apache.org/jira/browse/FLINK-17524
Solution: No action required. This warning message can be ignored and does not impact functionality.
Kafka pods fail to restart
Kafka pods fail to start when a worker node goes down, as in the following example:
evtmanager-kafka-0 1/2 Running 55 15h
evtmanager-kafka-1 2/2 Running 0 16h
evtmanager-kafka-2 1/2 Running 0 15h
evtmanager-kafka-3 1/2 Running 0 15h
evtmanager-kafka-4 1/2 Running 0 15h
evtmanager-kafka-5 2/2 Running 0 15h
evtmanager-zookeeper-0 1/1 Running 0 16h
evtmanager-zookeeper-1 1/1 Running 0 15h
evtmanager-zookeeper-2 1/1 Running 0 15h
If the worker node where the Kafka pods are located goes down, the Kafka pods might fail to restart on another node.
Solution: Restart all zookeeper and Kafka pods:
oc delete pod {evtmanager-zookeeper-0} {evtmanager-zookeeper-1} {evtmanager-zookeeper-n} {evtmanager-kafka-0} {evtmanager-kafka-1} {evtmanager-kafka-n}
Cloud Pak for AIOps console not accessible after shutdown of backup cluster
Unable to access the IBM Cloud Pak for AIOps console on the backup cluster after a systematic shut down and start up of the backup cluster. The following error message is displayed:
"Document was encrypted with unknown key '<namespace>-<release name>-<timestamp>'"
Solution: Find and then delete the required databases through the API.
-
First, run the following command to retrieve the hostname:
oc get route couchdb-georedundancy
For this scenario, use the hostname
couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com
. -
Run the two
SECRET_NAME
commands to retrieve theusername
andpassword
that are required to retrieve the database names.For Username:
SECRET_NAME=$(oc get secret | grep couchdb-secret | awk '{ print $1 }'); kubectl get secret ${SECRET_NAME} -o json | grep "username" | cut -d : -f2 | cut -d '"' -f2 | base64 -d username
For Password:
SECRET_NAME=$(oc get secret | grep couchdb-secret | awk '{ print $1 }'); kubectl get secret ${SECRET_NAME} -o json | grep "password" | cut -d : -f2 | cut -d '"' -f2 | base64 -d password
-
Use the
hostname
,username
, andpassword
obtained in the previous steps to run the command to retrieve the database names:curl -sk -u "username:password" https://couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com/_all_dbs | jq "."
Example output:
[ "_global_changes", "_replicator", "_users", "collabopsuser", "emailrecipients", "genericproperties", "icp-63666439356237652d336263372d343030362d613461382d613733613739633731323535-rba-as", "icp-63666439356237652d336263372d343030362d613461382d613733613739633731323535-rba-rbs", "integration", "noi-drdb", "noi-osregdb", "normalizercfd95b7e-3bc7-4006-a4a8-a73a79c71255", "osb-map", "otc_omaas_broker", "rba-pdoc", "schedule", "tenant", "trainingjob" ]
The relevant database name here is collabopsuser.
-
Run the following command to delete the database:
curl -sk -u "root:netcool" -X DELETE "https://couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com/${DATABASE_NAME}"
Where
DATABASE_NAME
=collabopsuser
-
Restart the
cem-users
pod to recreate the database:oc get pods |grep cem-users <cem_users_pod_name>
oc delete pod <cem_users_pod_name>
-
You can confirm the database is created by running the curl command from step 3:
curl -sk -u "username:password" https://couchdb-georedundancy-geonoi.apps.geo01primary.myibm.com/_all_dbs | jq "."
LAD Kafka intermittent Issue, as Log anomaly detector abruptly stops making predictions
The log anomaly detector abruptly stops making predictions.
On checking the anomaly pod logs, you see error messages related to Kafka, and due to these intermittent issues, log anomaly stops consuming data from the windowed logs topic, even when there are continuous windows generated from the data preprocessing component.
An example of such an error in the logs is as follows:
[2022-03-19 14:28:34,233] [logprophet.logs.watson_aiops] [ERROR] An error has occurred while processing message in batch mode. Error: 'NoneType' object is not iterable
[2022-03-19 14:28:34,234] [app.anomaly_detector.service_controller] [ERROR] Consumer failed with unhandled exception : 'NoneType' object is not iterable
Solution: There is a two-step process to resume the flow:
-
Clear the windowed logs topic content. This step is required to clear any old data that is not yet consumed by the log anomaly detector. If you do not clear this data, when LAD is ready to make predictions, it will try to generate events for old data showing a lag. So, you need to disable the data flow, and clear data from windowed logs.
-
Disable data flow from data integration
-
Clear the windowed logs topic using following command
oc edit kt cp4waiops-cartridge-windowed-logs-1000-1000
-
Set
retention.ms
to1
. -
Wait for a few minutes, and once the data is cleared and no more windows are generated, reset the 'retention.ms' to the original value
-
Enable the data flow in the data integration
-
-
Once the data is cleared using the above steps, restart the anomaly pod logs with the following command:
oc get po | grep anomaly oc delete pod <anomaly-pod>
EDB Postgres cluster failure
Run the following command on your cluster:
oc get clusters.postgresql.k8s.enterprisedb.io <installation_name>-edb-postgres -n <namespace>
Where:
<installation_name>
is the name of your IBM Cloud Pak for AIOps instance, for exampleibm-cp-aiops
.<namespace>
is the namespace that IBM Cloud Pak for AIOps is deployed in.
If the cluster STATUS does not show Cluster in healthy state
and the READY count is 0, then the EDB Postgres cluster has failed.
EDB Postgres is a component that is used by IBM Cloud Pak for AIOps. A defect in EDB Postgres 1.18.6 and lower can cause the EDB Postgres cluster to fail. If the pod that hosts the primary instance of EDB Postgres is deleted and there is not a healthy EDB Postgres replica instance available to promote, then the EDB Postgres operator may not reschedule the pod hosting the primary instance.
Solution: If you encounter this problem, then you must contact IBM Support who can help you to determine and bring up the primary EDB Postgres node.
Postgres EDB database crashes
After a failure of one of the Postgres EDB replicas, a new replica is unable to come up, and the failed replica remains.
Solution: Delete the failed replica and its persistent volume claim (PVC) if the replica failure is due to a synchronization problem.
-
Run the following command to find all Postgres instances that are not running:
export PROJECT_CP4AIOPS=<project> oc get pod -n $PROJECT_CP4AIOPS -l "k8s.enterprisedb.io/podRole=instance" | grep -vE "Running"
Where
<project>
is the project (namespace) where IBM Cloud Pak for AIOps is deployed, usuallycp4aiops
.Example output where
zen-metastore-edb-1
has failed:oc get pod -n cpaiops -l "k8s.enterprisedb.io/podRole=instance" | grep -vE "Running" NAMESPACE NAME READY STATUS RESTARTS AGE aiops zen-metastore-edb-1 0/1 CrashLoopBackOff 904 (36s ago) 7d1h
-
Examine the logs for the failed replica to see whether the non-primary replicas are not synchronized with the WAL logs in the primary replica.
oc logs -n "${PROJECT_CP4AIOPS}" <failed_replica>
Where
<failed_replica>
is the failed replica returned in step 1, for examplezen-metastore-edb-1
.Look for entries similar to the following examples:
pg_rewind: servers diverged at WAL location 5/6606A4E0 on timeline 1 pg_rewind: error: could not open file \"/var/lib/postgresql/data/pgdata/pg_wal/000000010000000500000068\": No such file or directory pg_rewind: fatal: could not read WAL record at 5/68000028
{"level":"info","ts":1677595436.8679705,"logger":"postgres","msg":"record","logging_pod":"ibm-cp-aiops-edb-postgres-1","record":{"log_time":"2023-02-28 14:43:56.867 UTC","user_name":"streaming_replica","process_id":"90988","connection_from":"10.254.24.22:32852","session_id":"63fe132c.1636c","session_line_num":"1","command_tag":"idle","session_start_time":"2023-02-28 14:43:56 UTC","virtual_transaction_id":"13/0","transaction_id":"0","error_severity":"ERROR","sql_state_code":"58P01","message":"requested WAL segment 0000000300000002000000A0 has already been removed","query":"START_REPLICATION 2/A0000000 TIMELINE 2","application_name":"ibm-cp-aiops-edb-postgres-2","backend_type":"walsender"}}
-
Check that the failed replica is not the primary replica.
Run the following command, and check that the failed replica identified in step 1 is not listed in the PRIMARY column for any of the Postgres clusters.
oc get clusters.postgresql.k8s.enterprisedb.io -n $PROJECT_CP4AIOPS
Example output:
oc get clusters.postgresql.k8s.enterprisedb.io -n cp4aiops NAME AGE INSTANCES READY STATUS PRIMARY common-service-db 23h 1 1 Cluster in healthy state common-service-db-1 ibm-cp-aiops-edb-postgres 23h 1 1 Cluster in healthy state ibm-cp-aiops-edb-postgres-1 zen-metastore-edb 23h 2 2 Cluster in healthy state zen-metastore-edb-2
-
Delete the failed replica and its PVC.
Run the following command if the failed replica is not the primary and the logs for the failed replica showed errors similar to the example output in step 2. Otherwise, contact IBM Support.
oc delete pod,pvc zen-metastore-edb-1 -n "${PROJECT_CP4AIOPS}"
Where
<failed_replica>
is the failed replica returned in step 1, for examplezen-metastore-edb-1
.
CouchDB pod reports a permission error when draining a node
The CouchDB
pod's log reports a permission error similar to the following example:
[error] 2022-12-02T01:24:20.726024Z couchdb@c-example-couchdbcluster-m-0.c-example-couchdbcluster-m <0.13913.7> -------- Could not open file /data/db/shards/00000000-7fffffff/_users.1669801403.couch: permission denied
Note: After a restore into a new namespace, similar symptoms can be seen. For more information about the solution for this scenario, see Restore fails with CouchDB pod in CrashLoopBackOff.
The way that Kubernetes mounts persistent volumes can cause this problem.
Solution: Delete the CouchDB
pods that are not ready. When the replacement pods are created, the persistent volumes are mounted correctly.
-
Find the names of your
couchdb
pods.oc get pods -l app.kubernetes.io/name=IssueResolutionCoreCouchDB -n <namespace>
Where
<namespace>
is the namespace that IBM Cloud Pak for AIOps is deployed in. -
Delete all the
couchdb
pods that the previous step showed as not ready.oc delete pod <couchdb-pod> -n <namespace>
Where:
<namespace>
is the namespace that IBM Cloud Pak for AIOps is deployed in.<couchdb-pod>
is a couchdb pod that is not ready.
Kafka broker pods are unhealthy because Kafka PVCs are full
When the Kafka broker pods are unhealthy due to insufficient storage, the Events operator is unable to expand the PVCs.
Solution: Check whether the Kafka broker pods have storage available, and expand the storage available to Kafka if they do not.
-
Check the state of the Kafka broker pods:
export AIOPS_NAMESPACE=<project> oc get pod -n "${AIOPS_NAMESPACE}" -l app.kubernetes.io/name=kafka
Where
<project>
is the namespace (project) that IBM Cloud Pak for AIOps is deployed.Example output:
iaf-system-kafka-0 1/1 Running 0 5d21h iaf-system-kafka-1 0/1 CreateContainerError 0 95m iaf-system-kafka-2 0/1 CreateContainerError 1 (2d18h ago) 5d21h
-
If the broker pods are in a CreateContainerError state, check the logs for errors:
oc logs -n "${AIOPS_NAMESPACE}" iaf-system-kafka-1
Example output:
Warning Failed 5m40s (x13 over 89m) kubelet Error: relabel failed /var/lib/kubelet/pods/37e6519a-b372-429d-b5f5-9aaf5b12fd37/volumes/kubernetes.io~csi/pvc-ff53e966-fee1-41e2-9448-b0eeac2e49c6/mount: lsetxattr /var/lib/kubelet/pods/37e6519a-b372-429d-b5f5-9aaf5b12fd37/volumes/kubernetes.io~csi/pvc-ff53e966-fee1-41e2-9448-b0eeac2e49c6/mount/kafka-log1/.lock: no space left on device
Do not proceed with the rest of these troubleshooting steps if the error is not related to disk space.
-
Adjust the Kafka storage size in the CustomResource:
oc patch kafka/iaf-system -n "${AIOPS_NAMESPACE}" --type merge -p '{"spec":{"kafka":{"storage":{"size":"<size>Gi"}}}}'
Where
<size>
is the increased storage size that you require for the Kafka PVCs. -
Increase the Kafka PVCs:
for p in $(oc get pvc -l app.kubernetes.io/name=kafka -n "${AIOPS_NAMESPACE}" -o jsonpath='{.items[*].metadata.name}'); do oc patch pvc $p -n "${AIOPS_NAMESPACE}" --type merge -p '{"spec":{"resources":{"requests":{"storage":"<size>Gi"}}}}' ; done
Where
<size>
is the same value that was used in the previous command.Note : The time required for this command to run depends on the number of PVCs and the amount of data that is stored in each one.
aimanager is not in Ready state after cluster restart, integrations are missing from the console
After a cluster restart, aimanager
does not have a ready state when queried with oc get aimanager -o yaml
, and some previously configured integrations such as log integrations, ChatOps, runbooks, kafka, or PagerDuty
are missing from the console. This problem can be caused by a timing issue where the aimanager
pod attempts to start before the Postgres
database is ready.
Solution
-
Run the following command to check if the
aimanager-aio-controller
pod is running.oc get pods | grep aimanager-aio-controller
Example output for a pod that is running correctly:
aimanager-aio-controller-67d6857cbb-n9skp 1/1 Running 0 7d3h
-
If the
aimanager-aio-controller
is not running, then check the pod's logs for errors.oc logs <aimanager-aio-controller-pod-name>
Where
<aimanager-aio-controller-pod-name>
is the name of theaimanager-aio-controller-pod-name
from step 1.Example error output:
2024-12-03 04:51:14 ERROR ServletListener:801 - Failed to get edb connection from SQL query java.sql.SQLException: Cannot create PoolableConnectionFactory (Connection to ibm-cp-aiops-edb-postgres-rw:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.) at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:653) ~[commons-dbcp2-2.9.0.jar:2.9.0]
If you have an error similar to the preceding example error output, then note down the name of the
aimanager-aio-controller
pod and use the following steps to resolve the problem. -
Check that the
Postgres
database is up and running, and do not continue until it is.oc get pods | grep edb-postgres
Example output if PostgreSQL is running:
aiops-installation-edb-postgres-1 1/1 Running 0 6d2h
-
Run the following command to restart the
aimanager-aio-controller
pod.oc delete pod <aimanager-aio-controller-pod-name>
Where
<aimanager-aio-controller-pod-name>
is the name of theaimanager-aio-controller-pod-name
from step 1.
Deployment on Linux: unable to access Cloud Pak for AIOps console
The pods in the IBM Cloud Pak for AIOps cluster are in a Running state, but the user interface is not accessible.
Solution: When you installed IBM Cloud Pak for AIOps, you used an existing load balancer or configured a new one of your choice. For more information, see Load balancing. The load balancer is the entry point for accessing deployments of IBM Cloud Pak for AIOps on Linux. Problems with your load balancer can prevent access to the Cloud Pak for AIOps console. Check the load balancer status and logs for problems, and restart the load balancer if it is not running.
Deployment on Linux: API server is down after a cluster restart
After restarting the Linux cluster, all requests to the Kubernetes API server such as oc get pod
and aiopsctl status
fail, as in the following example output:
$ oc get pod -n aiops
E0903 08:55:39.586644 1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.589138 1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.592133 1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.594610 1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
E0903 08:55:39.597172 1287 memcache.go:265] couldn't get current server API group list: the server is currently unable to handle the request
Error from server (ServiceUnavailable): the server is currently unable to handle the request
$ aiopsctl status
o- [03 Sep 24 08:58 PDT] Getting cluster status
[ERROR] Failed to get cluster status apiserver not ready
Solution: Run the following steps to confirm that the problem is due to the Kubernetes API server being down, and to rectify this.
-
Check the control plane nodes.
-
On each control plane node, run the following command to assess the platform state. Affected deployments have a state of activating.
systemctl status k3s.service
Example output for a deployment affected by this issue:
$ systemctl status k3s.service k3s.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled) Active: activating (start) since Mon 2024-09-09 05:10:30 PDT; 6s ago
-
On each control plane node, run the following command to check the platform logs:
journalctl -xeu k3s.service
Example output for a deployment affected by this issue:
Sep 09 05:11:28 control-plane-2.acme.com k3s[50701]: {"level":"warn","ts":"2024-09-09T05:11:28.735466-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"8c10b42dfa5ffc54","rem> Sep 09 05:11:28 control-plane-2.acme.com k3s[50701]: {"level":"warn","ts":"2024-09-09T05:11:28.828015-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"8c10b42dfa5ffc54","rem> Sep 09 05:11:28 control-plane-2.acme.com k3s[50701]: {"level":"warn","ts":"2024-09-09T05:11:28.834347-0700","caller":"rafthttp/http.go:145","msg":"failed to process Raft message","local-member-id":"8c10b42dfa5ffc54","error":"ra>
If your deployment is affected, continue with the following steps.
-
-
Stop
k3s
on the primary control plane node:systemctl stop k3s
Note: The primary control plane is the first control plane attached to the cluster. This is the value of CONTROL_PLANE_NODE in your aiops_var.sh environment variables file. The non-primary control plane nodes are the nodes in the ADDITIONAL_CONTROL_PLANE_NODES array in your aiops_var.sh environment variables file.
-
Run a cluster reset on the primary control plane node to soft reset
etcd
.k3s server --cluster-reset
-
Stop k3s on the non-primary control plane nodes.
systemctl stop k3s
-
Back up the
etcd
data directory on each non-primary control plane node.cp -r <data_directory> <backup_directory>
Where
<data_directory>
is:/var/lib/rancher/k3s/server/db
if you are not using a custom platform directory, orPLATFORM_STORAGE_PATH
in aiops_var.sh if you are using a custom platform directory<backup_directory>
is a directory where you want to store the backup.
-
Delete the
etcd
data directory on each non-primary control plane node.Important: Do not delete the platform data directory on the primary control plane node.
rm -rf <data_directory>
Where
<data_directory>
is/var/lib/rancher/k3s/server/db
if you are not using a custom platform directory, orPLATFORM_STORAGE_PATH
in aiops_var.sh if you are using a custom platform directory. -
Run the following command on the primary control plane node, and then on each of the non-primary control plane nodes:
systemctl start k3s