Troubleshooting installation and upgrade
Learn how to isolate and resolve problems with installing and upgrading IBM Cloud Pak for AIOps.
Known issues and limitations
In addition to the common problems and solutions that are listed, review the installation related known issues for potential issues that you might encounter during installation. For more information, see Known issues and limitations.
Troubleshooting tips
- General troubleshooting tips (Console)
- General troubleshooting tips (CLI)
- Troubleshooting storage
- Recovering IBM Cloud Pak for AIOps after a cluster restart
Note: If you have an aborted or stalled installation, uninstall any installed or partially installed components and restart the installation. For more information about uninstalling, see the following topics:
Installation issues
- Installation fails to proceed when creating the IBM Cloud Pak for AIOps custom resource
- The AI manager operator does not progress past
Reconciling, and the Cassandra setup pod is stuck - "Application is not available" message is displayed intermittently
- Installation stalls with aiops-ir-analytics-cassandra-setup job in CrashLoopBackOff
- Installation on Linux only: fails with network connectivity errors when using
aiopsctlto install on vSphere VMs - Installation on Linux only: deployment is unstable after cluster restart
- Installation on Linux only: unable to log in to IBM Cloud Pak for AIOps
- Pods stuck after running
aiopsctl node up, with error: "failed to pull image" (deployments on Linux only)
Upgrade issues
Issues common to installation and upgrade
General troubleshooting
General troubleshooting tips (Console)
-
From the Red Hat® OpenShift® console, go to Workloads > Pods.
-
Select the project that IBM Cloud Pak for AIOps is installed in from the drop-down menu to view the status of all the pods in this project.
-
If you want to see the status and details for one of the pods, click the pod that you want to investigate.
-
If you want to see the logs for the pod, click Logs.
General troubleshooting tips (CLI)
-
To find out the status of all the pods in a project, run the following command:
oc get pods -n <project>Where
<project>is the project (namespace) where IBM Cloud Pak for AIOps is deployed. -
To see the status of a particular pod, run the following command:
oc get pods | grep <pod_name>Where
<pod_name>is the pod that you want to see status for. -
To see a pod's logs, run the following command:
oc logs -f <pod_name> -n <project>Where
<pod_name>is the pod that you want to see status for.<project>is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
Troubleshooting storage
Storage that is associated with persistent volumes claims (PVCs) reaches capacity. This issue can occur as storage requests and resources fluctuate over time.
Solution: If storage fills up, increase the size of the PVC. For more information about increasing the size of a PVC, see Storage class requirements.
Recovering IBM Cloud Pak for AIOps after a cluster restart
After a Red Hat® OpenShift® Container Platform cluster restart, your IBM Cloud Pak for AIOps deployment has not come up successfully.
Use the following steps to help you to identify any problems.
-
Run the following command and then check that all of the OpenShift nodes come back up, and that they all report a STATUS of
Ready.oc get nodesIf all of the OpenShift nodes have not come back up, then see the Updating clusters and Troubleshooting sections in the OpenShift documentation
for guidance.
-
Run the following command and then check that the IBM Cloud Pak for AIOps pods all have a STATUS OF
RunningorCompleted.oc get pods -n <project>Where
<project>is the project (namespace) where IBM Cloud Pak for AIOps is deployed. -
Check the logs of any pods that did not report a STATUS of
RunningorCompletedin steps 2 and 3 to see whether you can identify a cause.oc logs -f <failed_pod_name> -n <project>Where
<project>is the project (namespace) where IBM Cloud Pak for AIOps is deployed.<failed_pod_name>is a failed pod name, as returned in steps 2 and 3.
Solution: Restart any failed pods with the following command:
oc delete pod <failed_pod_name> -n <project>
Where
<project>is the project (namespace) where IBM Cloud Pak for AIOps is deployed.<failed_pod_name>is the pod name as returned in steps 2 and 3.
Troubleshooting installation
Installation not proceeding after creating the IBM Cloud Pak for AIOps custom resource
When you are installing, the installation is not proceeding as expected after you create an instance of the IBM Cloud Pak for AIOps custom resource. For instance, after you create the instance, the installation process appears to be stalled, however no pods in the namespace and cluster display show errors.
This stalling of the installation process can be due to an issue with the Operator Lifecycle Manager (OLM), which causes the expected ClusterServiceVersion (CSV) for some operator subscriptions to not be found. For example, installPlan instances
for the aimanager-operator, connector-utilities-operator, and other operators can be found, but without a reference to the required ClusterServiceVersion (CSV) YAML. While the operators can be created within your
defined namespace, if the ClusterServiceVersion (CSV) cannot be found, the operators cannot be managed. If this issue occurs, the overall installation cannot proceed correctly.
Solution: Delete any operator subscription that does not have an associated ClusterServiceVersion (CSV) YAML and has an unresolved subscription. The installation process should re-create the associated operator and generate a new installPlan. If a new installPlan is not generated, repeat the process to delete the operator subscription and try again.
The AI manager operator does not progress past Reconciling, and the Cassandra setup pod is stuck
When installing, the AI Manager operator remains in Phase: Reconciling. The aiops-ir-analytics-cassandra-setup pod is in Crashloopbackoff state, and its log has the following error:
ERROR [2023-06-13 09:57:44,010] com.ibm.itsm.analytics.metric.service.cassandra.SessionManager: Cassandra not available.
Solution: Use the following steps to reset the Cassandra admin password, and delete the aiops-ir-analytics-cassandra-setup and aiops-ir-core-archiving-setup jobs.
-
Run the following command to switch to your IBM Cloud Pak for AIOps project.
oc project <project>Where
<project>is the project that your IBM Cloud Pak for AIOps installation is deploying in. -
Reset the Cassandra admin password.
oc rsh aiops-topology-cassandra-0 bash -c 'python3 -u /opt/ibm/change-superuser.pyc --initial --user admin --password $(cat /opt/ibm/cassandra_auth/password) --replication-factor $CASSANDRA_AUTH_REPLICATION_FACTOR' -
Delete the
aiops-ir-analytics-cassandra-setupjob, if it does not own any pods.-
Check if there are any pods belonging to the
aiops-ir-analytics-cassandra-setupjob.oc get pod -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup -
If no pods are returned, then delete the
aiops-ir-analytics-cassandra-setupjob.oc delete job -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
-
-
Delete the
aiops-ir-core-archiving-setupjob, if it does not own any pods.-
Check if there are any pods belonging to the
aiops-ir-core-archiving-setupjob.oc get pod -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup -
If no pods are returned, then delete the
aiops-ir-core-archiving-setupjob.oc delete job -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
-
"Application is not available" message is displayed intermittently.
This is caused by a defect in some versions of Red Hat OpenShift. For more information, see OpenShift 4 - Intermittent 503 from all routes or timeouts when calling to backends directly when allow-from-openshift-ingress network policy is applied in the Red Hat OpenShift documentation.
Solution: Upgrade to a supported version of Red Hat OpenShift. For more information, see Supported Red Hat OpenShift versions.
Installation on Linux fails with network connectivity errors when using aiopsctl to install on vSphere VMs
The aiopsctl script installs the Kubernetes distribution k3s. When k3s is installed on a VMWare vSphere virtual machine (VM), routing across ClusterIPs does not work. The creation of the subscriptions
for Cert Manager and the Licensing Service fail with errors similar to the following because they cannot reach the catalog:
- message: 'failed to populate resolver cache from source ibm-aiops-catalog/olm:
failed to list bundles: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing: dial tcp: lookup ibm-aiops-catalog.olm.svc
on 10.43.0.10:53: read udp 10.42.1.2:41727->10.43.0.10:53: i/o timeout"'
This problem is caused by a known kernel issue with Red Hat® Enterprise Linux® on vSphere. To fix the problem, run the following command on each of your nodes:
ethtool -K flannel.1 tx-checksum-ip-generic off
For more information, see k3s support
Installation on Linux: deployment is unstable after cluster restart
After a cluster restart, one of IBM Cloud Pak for AIOps's components, zookeeper, cannot start due to issues around DNS resolution causing it to misidentify its assigned cluster IP address.
Running the following command shows that iaf-system-kafka-0 and some other pods failed:
kubectl get pods
Running the following command shows errors in the zookeeper logs:
kubectl logs -l app.kubernetes.io/name=zookeeper --tail -1
Example errors:
2024-06-17 20:57:26,674 ERROR Couldn't bind to iaf-system-zookeeper-0.iaf-system-zookeeper-nodes.aiops.svc/10.42.1.53:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
java.net.BindException: Cannot assign requested address
java.net.SocketException: Unresolved address
Solution: Run the following command to restart the zookeeper pods, and then wait a few minutes for the system to recover:
kubectl delete pods -l app.kubernetes.io/name=zookeeper
Installation on Linux: unable to log in to IBM Cloud Pak for AIOps
If you have a deployment of IBM Cloud Pak for AIOps on Linux and your Domain Name System (DNS) does not provide resolution for the cp-console hostname on all of your nodes, then you may not be able to log in to the Cloud Pak for
AIOps console.
Run the following command from the control plane node to view the platform-identity-provider pod logs:
oc logs deployment/platform-identity-provider -n aiops
Example output where cp-console is not resolved:
{"name":"platform-identity-provider","hostname":"platform-identity-provider-79855f988-zqp2z","pid":1,"level":50,"err":{"message":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","name":"Error","stack":"Error: getaddrinfo ENOTFOUND cp-console-aiops.example.com\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:107:26)","code":"ENOTFOUND"},"msg":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","time":"2025-03-24T20:10:31.696Z","v":0}
Solution: Use one of the following options to resolve this issue:
Option 1: Configure DNS records to resolve the hosts for accessing the Cloud Pak for AIOps console, and ensure that the DNS is used by all of the nodes in the cluster. For more information, see DNS requirements.
Option 2: Configure CoreDNS to provide DNS resolution within the cluster. Use the following steps:
-
Run the following command to create a ConfigMap with a custom configuration for the CoreDNS server:
oc apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: coredns-custom namespace: kube-system data: default.server: | cp-console-aiops.<example.com> { hosts { <load_balancer_ip> cp-console-aiops.<example.com> fallthrough } } EOFWhere:
<example.com>is your hostname<load_balancer_ip>is the IP address of your load balancer
-
Run the following command to apply the configuration:
oc -n kube-system rollout restart deployment coredns
Pods stuck after running aiopsctl node up, with error: "failed to pull image" (deployments on Linux only)
If you are using a proxy and pods are stuck in ContainerCreating or ImagePullBackOff after running the aiopsctl node up command, you might have missing or incorrect proxy settings on one or more nodes
that prevents successful image pulls.
-
Run the following command to view errors for the pod that is stuck.
oc describe pod <pod-name> -n <namespace>Example output where image pulls fail because of missing or incorrect proxy configuration:
FailedCreatePodSandBox 1m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to pull image ... dial tcp <registry-ip>:443: i/o timeout
-
Verify whether your system environment configuration settings are set up correctly for your proxy.
Run the following command on each of your control plane nodes:
cat /etc/systemd/system/k3s.service.envRun the following command on each of your worker nodes:
cat /etc/systemd/system/k3s-agent.service.envExample output:
HTTP_PROXY=<Your proxy> HTTPS_PROXY=<Your proxy> NO_PROXY=127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 # sample list of IPs
Check that the
NO_PROXYvariable includes the IP address ranges for the public and private IPs of the cluster nodes. -
If the proxy variables are missing or incorrect, correct them and apply your changes with the following steps.
-
Add or update them in the appropriate
.envfile (/etc/systemd/system/k3s.service.envor/etc/systemd/system/k3s-agent.service.env). -
Restart the
k3sservice.Run the following command on each of the control plane nodes:
sudo systemctl restart k3sRun the following command on each of the worker nodes:
sudo systemctl restart k3s-agent -
Validate connectivity and image pulls.
Run the following command to manually pull an image and confirm network access:
crictl pull $(cat /etc/rancher/k3s/config.yaml | grep "pause-image:" | sed 's/pause-image: //')If the image pull succeeds, then the proxy configuration is now correct.
-
Installation stalls with aiops-ir-analytics-cassandra-setup job in CrashLoopBackOff
Installation stalls because the aiops-ir-analytics-cassandra-setup pod is in CrashLoopBackOff state, and its log has an error similar to the following:
ERROR [2025-05-05 08:54:52,906] com.ibm.itsm.topology.service.app.BaseServiceApp: java.lang.IllegalStateException: Schema is not in agreement, wait and retry
Solution: Restart the Cassandra StatefulSets.
-
Run the following command to export an environment variable for your IBM Cloud Pak for AIOps project.
export AIOPS_NAMESPACE=<project> oc project <project>Where
<project>is the project that your IBM Cloud Pak for AIOps installation is deploying in. -
Make a note of the number of replicas for Cassandra.
oc get statefulset aiops-topology-cassandra -n ${AIOPS_NAMESPACE}Example output for 3 Cassandra StatefulSet replicas:
NAME READY AGE aiops-topology-cassandra 0/3 43m
-
Scale down the Cassandra StatefulSet replicas to 0.
oc scale statefulsets aiops-topology-cassandra --replicas=0 -n ${AIOPS_NAMESPACE} -
Manually scale up the first Cassandra StatefulSet replica.
oc scale statefulsets aiops-topology-cassandra --replicas=1 -n ${AIOPS_NAMESPACE}Wait until the node is ready.
-
Rerun step 4 with an incremented number for
replicas, and wait until the node is ready.For example,
oc scale statefulsets aiops-topology-cassandra --replicas=2 -n ${AIOPS_NAMESPACE}Repeat this step until you have the number of Cassandra StatefulSet replicas that you noted in step 2.
Troubleshooting upgrade
After upgrade, policies do not process alerts due to problem with the aiops-ir-lifecycle-create-policies-job
This issue can occur if there is a problem connecting to Cassandra during upgrade. The aiops-ir-lifecycle-create-policies-job reports successful completion even it was unable to connect to Cassandra and run some of its policy-related
queries. After upgrade policies do not process alerts, and error messages similar to the following are present in the aiops-ir-lifecycle-create-policies-job's log:
"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":40,"err":{"message":"All host(s) tried for query failed.
"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":50,"msg":"Failed to update policy with content
Solution: Restart the aiops-ir-lifecycle-create-policies-job by running the following command:
oc delete job aiops-ir-lifecycle-create-policies-job
After upgrade, alerts are not visible in the UI
-
Run the following command to view the LifecycleTrigger job.
oc get LifecycleTrigger aiops -o yaml
-
Check the
statussection of the returned YAML.Example
statussection:status: conditions: - lastTransitionTime: "2024-04-15T10:12:44Z" message: Job '5e569b68014d9a669574096001ce51e6' in state 'RESTARTING' observedGeneration: 2 reason: NotReady status: "False" type: LifecycleTriggerReady jobs: - jid: 5e569b68014d9a669574096001ce51e6 startTime: "2024-04-13T20:22:17Z" state: RESTARTING savepoints: - path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-95d27a6574b4 time: "2024-04-13T20:10:56Z" - lifecycleVersion: 4.5.1 namespace: aiops path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd time: "2024-04-13T20:15:22Z" - lifecycleVersion: 4.5.1 namespace: aiops path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd-upgraded time: "2024-04-13T20:21:53Z" - lifecycleVersion: 4.5.1 namespace: aiops path: s3://aiops-ir-lifecycle/savepoints/savepoint-5e569b-6d60d2f35ca8 time: "2024-04-13T20:23:20Z"
Check to see if:
- the
conditionssection hasJob <...> in state 'RESTARTING'. - the
savepointssection has an entry with a path that ends with-upgraded, which follows an entry with the same path without this suffix. - the two entries that are described above are followed by at least one more entry. This check is important.
- the
Solution: If all of the above checks are true, then you can remediate this problem with the following steps.
-
Note the value of
jidin the status block (5e569b68014d9a669574096001ce51e6in the example above) and export it as an environment variable.export JOB_ID=<jid> -
Run the following command to cancel the LifecycleTrigger job that is restarting.
oc patch LifecycleTrigger aiops --type json --patch '[{"op":"add","path":"/spec/cancelJobs","value":["'${JOB_ID}'"]}]' -
Check the LifecycleTrigger YAML again.
oc get LifecycleTrigger aiops -o yamlAfter about a minute the
status.jobssection should have two entries:- The first entry has the old job ID and the state will change to CANCELLED.
- the second entry has a new job ID and the state should eventually settle at RUNNING.
Issues that are common to installation and upgrade
Offline install or upgrade throws 'invalid character' error
When doing an offline install or upgrade, running the oc ibm-pak generate mirror-manifests <..> command throws an error similar to the following:
Error: failed to load the catalog FBC at cp.stg.icr.io/cp/ <...> invalid character '<' in string escape code
Solution: You must have IBM Catalog Management Plug-in for IBM Cloud Pak (ibm-pak-plugin) v1.10 or higher installed. Run the following commands to ensure that ibm-pak-plugin is at the
required level.
-
Check which version of
ibm-pak-pluginyou have installed.Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.
oc ibm-pak --versionExample output:
oc ibm-pak --version v1.11.0
-
If the
ibm-pak-pluginversion is lower than v1.10.0, then you must download and install the most recent version.Follow the steps for your installation approach:
-
Bastion host: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Portable device: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'
The oc ibm-pak generate mirror-manifests $CASE_NAME $TARGET_REGISTRY --version $CASE_VERSION command fails with a message similar to the following in $IBMPAK_HOME/.ibm-pak/logs/oc-ibm_pak.log:
write /tmp/render-unpack-4002583241/var/lib/rpm/Packages: no space left on device
Solution: The default temporary directory does not have enough space to run the ibm-pak tool. You must set the TMPDIR environment variable to a different directory with more space before running the
oc ibm-pak generate mirror-manifests command.
TMPDIR=<new_temp_dir> oc ibm-pak generate mirror-manifests ${CASE_NAME} ${TARGET_REGISTRY} --version ${CASE_VERSION}
Where <new_temp_dir> is the path for a directory with more space.