Troubleshooting installation and upgrade
Learn how to isolate and resolve problems with installing and upgrading IBM Cloud Pak for AIOps.
Known issues and limitations
In addition to the common problems and solutions that are listed, review the installation related known issues for potential issues that you might encounter during installation. For more information, see Known issues and limitations.
Troubleshooting tips
- General troubleshooting tips (Console)
- General troubleshooting tips (CLI)
- Troubleshooting storage
- Recovering IBM Cloud Pak for AIOps after a cluster restart
Note: If you have an aborted or stalled installation, uninstall any installed or partially installed components and restart the installation. For more information about uninstalling, see the following topics:
Installation issues
- Installation fails to proceed when creating the IBM Cloud Pak for AIOps custom resource
- The AI manager operator does not progress past
Reconciling
, and the Cassandra setup pod is stuck - "Application is not available" message is displayed intermittently
- Installation on Linux fails with network connectivity errors when using
aiopsctl
to install on vSphere VMs - Installation on Linux: deployment is unstable after cluster restart
- Installation on Linux: unable to log in to IBM Cloud Pak for AIOps
Upgrade issues
Issues common to installation and upgrade
General troubleshooting
General troubleshooting tips (Console)
-
From the Red Hat® OpenShift® console, go to Workloads > Pods.
-
Select the project that IBM Cloud Pak for AIOps is installed in from the drop-down menu to view the status of all the pods in this project.
-
If you want to see the status and details for one of the pods, click the pod that you want to investigate.
-
If you want to see the logs for the pod, click Logs.
General troubleshooting tips (CLI)
-
To find out the status of all the pods in a project, run the following command:
oc get pods -n <project>
Where
<project>
is the project (namespace) where IBM Cloud Pak for AIOps is deployed. -
To see the status of a particular pod, run the following command:
oc get pods | grep <pod_name>
Where
<pod_name>
is the pod that you want to see status for. -
To see a pod's logs, run the following command:
oc logs -f <pod_name> -n <project>
Where
<pod_name>
is the pod that you want to see status for.<project>
is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
Troubleshooting storage
Storage that is associated with persistent volumes claims (PVCs) reaches capacity. This issue can occur as storage requests and resources fluctuate over time.
Solution: If storage fills up, increase the size of the PVC. For more information about increasing the size of a PVC, see Storage class requirements.
Recovering IBM Cloud Pak for AIOps after a cluster restart
After a Red Hat® OpenShift® Container Platform cluster restart, your IBM Cloud Pak for AIOps deployment has not come up successfully.
Use the following steps to help you to identify any problems.
-
Run the following command and then check that all of the OpenShift nodes come back up, and that they all report a STATUS of
Ready
.oc get nodes
If all of the OpenShift nodes have not come back up, then see the Updating clusters and Troubleshooting sections in the OpenShift documentation
for guidance.
-
Run the following command and then check that the IBM Cloud Pak for AIOps pods all have a STATUS OF
Running
orCompleted
.oc get pods -n <project>
Where
<project>
is the project (namespace) where IBM Cloud Pak for AIOps is deployed. -
Check the logs of any pods that did not report a STATUS of
Running
orCompleted
in steps 2 and 3 to see whether you can identify a cause.oc logs -f <failed_pod_name> -n <project>
Where
<project>
is the project (namespace) where IBM Cloud Pak for AIOps is deployed.<failed_pod_name>
is a failed pod name, as returned in steps 2 and 3.
Solution: Restart any failed pods with the following command:
oc delete pod <failed_pod_name> -n <project>
Where
<project>
is the project (namespace) where IBM Cloud Pak for AIOps is deployed.<failed_pod_name>
is the pod name as returned in steps 2 and 3.
Troubleshooting installation
Installation not proceeding after creating the IBM Cloud Pak for AIOps custom resource
When you are installing, the installation is not proceeding as expected after you create an instance of the IBM Cloud Pak for AIOps custom resource. For instance, after you create the instance, the installation process appears to be stalled, however no pods in the namespace and cluster display show errors.
This stalling of the installation process can be due to an issue with the Operator Lifecycle Manager (OLM), which causes the expected ClusterServiceVersion (CSV) for some operator subscriptions to not be found. For example, installPlan instances
for the aimanager-operator
, connector-utilities-operator
, and other operators can be found, but without a reference to the required ClusterServiceVersion (CSV) YAML. While the operators can be created within your
defined namespace, if the ClusterServiceVersion (CSV) cannot be found, the operators cannot be managed. If this issue occurs, the overall installation cannot proceed correctly.
Solution: Delete any operator subscription that does not have an associated ClusterServiceVersion (CSV) YAML and has an unresolved subscription. The installation process should re-create the associated operator and generate a new installPlan. If a new installPlan is not generated, repeat the process to delete the operator subscription and try again.
The AI manager operator does not progress past Reconciling
, and the Cassandra setup pod is stuck
When installing, the AI Manager operator remains in Phase: Reconciling. The aiops-ir-analytics-cassandra-setup
pod is in Crashloopbackoff state, and its log has the following error:
ERROR [2023-06-13 09:57:44,010] com.ibm.itsm.analytics.metric.service.cassandra.SessionManager: Cassandra not available.
Solution: Use the following steps to reset the Cassandra admin password, and delete the aiops-ir-analytics-cassandra-setup
and aiops-ir-core-archiving-setup
jobs.
-
Run the following command to switch to your IBM Cloud Pak for AIOps project.
oc project <project>
Where
<project>
is the project that your IBM Cloud Pak for AIOps installation is deploying in. -
Reset the Cassandra admin password.
oc rsh aiops-topology-cassandra-0 bash -c 'python3 -u /opt/ibm/change-superuser.pyc --initial --user admin --password $(cat /opt/ibm/cassandra_auth/password) --replication-factor $CASSANDRA_AUTH_REPLICATION_FACTOR'
-
Delete the
aiops-ir-analytics-cassandra-setup
job, if it does not own any pods.-
Check if there are any pods belonging to the
aiops-ir-analytics-cassandra-setup
job.oc get pod -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
-
If no pods are returned, then delete the
aiops-ir-analytics-cassandra-setup
job.oc delete job -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
-
-
Delete the
aiops-ir-core-archiving-setup
job, if it does not own any pods.-
Check if there are any pods belonging to the
aiops-ir-core-archiving-setup
job.oc get pod -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
-
If no pods are returned, then delete the
aiops-ir-core-archiving-setup
job.oc delete job -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
-
"Application is not available" message is displayed intermittently.
This is caused by a defect in some versions of Red Hat OpenShift. For more information, see OpenShift 4 - Intermittent 503 from all routes or timeouts when calling to backends directly when allow-from-openshift-ingress
network policy is applied in the Red Hat OpenShift documentation.
Solution: Upgrade to a supported version of Red Hat OpenShift. For more information, see Supported Red Hat OpenShift versions.
Installation on Linux fails with network connectivity errors when using aiopsctl
to install on vSphere VMs
The aiopsctl
script installs the Kubernetes distribution k3s
. When k3s
is installed on a VMWare vSphere virtual machine (VM), routing across ClusterIPs does not work. The creation of the subscriptions
for Cert Manager
and the Licensing Service
fail with errors similar to the following because they cannot reach the catalog:
- message: 'failed to populate resolver cache from source ibm-aiops-catalog/olm:
failed to list bundles: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing: dial tcp: lookup ibm-aiops-catalog.olm.svc
on 10.43.0.10:53: read udp 10.42.1.2:41727->10.43.0.10:53: i/o timeout"'
This problem is caused by a known kernel issue with Red Hat® Enterprise Linux® on vSphere. To fix the problem, run the following command on each of your nodes:
ethtool -K flannel.1 tx-checksum-ip-generic off
For more information, see k3s support
Installation on Linux: deployment is unstable after cluster restart
After a cluster restart, one of IBM Cloud Pak for AIOps's components, zookeeper
, cannot start due to issues around DNS resolution causing it to misidentify its assigned cluster IP address.
Running the following command shows that iaf-system-kafka-0
and some other pods failed:
kubectl get pods
Running the following command shows errors in the zookeeper
logs:
kubectl logs -l app.kubernetes.io/name=zookeeper --tail -1
Example errors:
2024-06-17 20:57:26,674 ERROR Couldn't bind to iaf-system-zookeeper-0.iaf-system-zookeeper-nodes.aiops.svc/10.42.1.53:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
java.net.BindException: Cannot assign requested address
java.net.SocketException: Unresolved address
Solution: Run the following command to restart the zookeeper
pods, and then wait a few minutes for the system to recover:
kubectl delete pods -l app.kubernetes.io/name=zookeeper
Installation on Linux: unable to log in to IBM Cloud Pak for AIOps
If you have a deployment of IBM Cloud Pak for AIOps on Linux and your Domain Name System (DNS) does not provide resolution for the cp-console
hostname on all of your nodes, then you may not be able to log in to the Cloud Pak for
AIOps console.
Run the following command from the control plane node to view the platform-identity-provider
pod logs:
oc logs deployment/platform-identity-provider -n aiops
Example output where cp-console
is not resolved:
{"name":"platform-identity-provider","hostname":"platform-identity-provider-79855f988-zqp2z","pid":1,"level":50,"err":{"message":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","name":"Error","stack":"Error: getaddrinfo ENOTFOUND cp-console-aiops.example.com\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:107:26)","code":"ENOTFOUND"},"msg":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","time":"2025-03-24T20:10:31.696Z","v":0}
Solution: Use one of the following options to resolve this issue:
Option 1: Configure DNS records to resolve the hosts for accessing the Cloud Pak for AIOps console, and ensure that the DNS is used by all of the nodes in the cluster. For more information, see DNS requirements.
Option 2: Configure CoreDNS to provide DNS resolution within the cluster. Use the following steps:
-
Run the following command to create a ConfigMap with a custom configuration for the CoreDNS server:
oc apply -f - <<EOF apiVersion: v1 kind: ConfigMap metadata: name: coredns-custom namespace: kube-system data: default.server: | cp-console-aiops.<example.com> { hosts { <load_balancer_ip> cp-console-aiops.<example.com> fallthrough } } EOF
Where:
<example.com>
is your hostname<load_balancer_ip>
is the IP address of your load balancer
-
Run the following command to apply the configuration:
oc -n kube-system rollout restart deployment coredns
Troubleshooting upgrade
After upgrade, policies do not process alerts due to problem with the aiops-ir-lifecycle-create-policies-job
This issue can occur if there is a problem connecting to Cassandra during upgrade. The aiops-ir-lifecycle-create-policies-job
reports successful completion even it was unable to connect to Cassandra and run some of its policy-related
queries. After upgrade policies do not process alerts, and error messages similar to the following are present in the aiops-ir-lifecycle-create-policies-job's
log:
"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":40,"err":{"message":"All host(s) tried for query failed.
"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":50,"msg":"Failed to update policy with content
Solution: Restart the aiops-ir-lifecycle-create-policies-job
by running the following command:
oc delete job aiops-ir-lifecycle-create-policies-job
After upgrade, alerts are not visible in the UI
-
Run the following command to view the LifecycleTrigger job.
oc get LifecycleTrigger aiops -o yaml
-
Check the
status
section of the returned YAML.Example
status
section:status: conditions: - lastTransitionTime: "2024-04-15T10:12:44Z" message: Job '5e569b68014d9a669574096001ce51e6' in state 'RESTARTING' observedGeneration: 2 reason: NotReady status: "False" type: LifecycleTriggerReady jobs: - jid: 5e569b68014d9a669574096001ce51e6 startTime: "2024-04-13T20:22:17Z" state: RESTARTING savepoints: - path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-95d27a6574b4 time: "2024-04-13T20:10:56Z" - lifecycleVersion: 4.5.1 namespace: aiops path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd time: "2024-04-13T20:15:22Z" - lifecycleVersion: 4.5.1 namespace: aiops path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd-upgraded time: "2024-04-13T20:21:53Z" - lifecycleVersion: 4.5.1 namespace: aiops path: s3://aiops-ir-lifecycle/savepoints/savepoint-5e569b-6d60d2f35ca8 time: "2024-04-13T20:23:20Z"
Check to see if:
- the
conditions
section hasJob <...> in state 'RESTARTING'
. - the
savepoints
section has an entry with a path that ends with-upgraded
, which follows an entry with the same path without this suffix. - the two entries that are described above are followed by at least one more entry. This check is important.
- the
Solution: If all of the above checks are true, then you can remediate this problem with the following steps.
-
Note the value of
jid
in the status block (5e569b68014d9a669574096001ce51e6
in the example above) and export it as an environment variable.export JOB_ID=<jid>
-
Run the following command to cancel the LifecycleTrigger job that is restarting.
oc patch LifecycleTrigger aiops --type json --patch '[{"op":"add","path":"/spec/cancelJobs","value":["'${JOB_ID}'"]}]'
-
Check the LifecycleTrigger YAML again.
oc get LifecycleTrigger aiops -o yaml
After about a minute the
status.jobs
section should have two entries:- The first entry has the old job ID and the state will change to CANCELLED.
- the second entry has a new job ID and the state should eventually settle at RUNNING.
Navigation menu items disappear after an upgrade
You might notice an issue where navigation menu items in IBM Cloud Pak for AIOps console disappear when you upgrade .
To determine if you are encountering this issue, run the following command:
kubectl logs -l component=zen-core-api
The output can resemble the following example:
time="2024-07-18 13:46:00" level=error msg=GetNavExtensions func=zen-core-api/source/apis/internal_extensions/nav.GetNavItemsWrapper file="/go/src/zen-core-api/source/apis/internal_extensions/nav/get_nav_items.go:124" exception="circular dependency or dangling node found in aiops-nav-menuitem-alerts_page,aiops-connectors-nav,aiops-nav-menuitem-metric_search,aiops-insights-nav,aiops-nav-menuitem-incidents_page,aiops-aimodel-nav,aiops-nav-menuitem-resource_management,aiops-nav-menuitem-automations"
Solution: If you see the preceding error, then use the following step to resolve the menu items disappear issue:
-
Delete the
aiops-baseui-frontdoor-ext
resource:oc delete zenextension aiops-baseui-frontdoor-ext
The output can resemble the following example:
zenextension.zen.cpd.ibm.com "aiops-baseui-frontdoor-ext" deleted
Issues that are common to installation and upgrade
Offline install or upgrade throws 'invalid character' error
When doing an offline install or upgrade, running the oc ibm-pak generate mirror-manifests <..>
command throws an error similar to the following:
Error: failed to load the catalog FBC at cp.stg.icr.io/cp/ <...> invalid character '<' in string escape code
Solution: You must have IBM Catalog Management Plug-in for IBM Cloud Pak
(ibm-pak-plugin
) v1.10 or higher installed. Run the following commands to ensure that ibm-pak-plugin
is at the
required level.
-
Check which version of
ibm-pak-plugin
you have installed.Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.
oc ibm-pak --version
Example output:
oc ibm-pak --version v1.11.0
-
If the
ibm-pak-plugin
version is lower than v1.10.0, then you must download and install the most recent version.Follow the steps for your installation approach:
-
Bastion host: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Portable device: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'
The oc ibm-pak generate mirror-manifests $CASE_NAME $TARGET_REGISTRY --version $CASE_VERSION
command fails with a message similar to the following in $IBMPAK_HOME/.ibm-pak/logs/oc-ibm_pak.log
:
write /tmp/render-unpack-4002583241/var/lib/rpm/Packages: no space left on device
Solution: The default temporary directory does not have enough space to run the ibm-pak
tool. You must set the TMPDIR
environment variable to a different directory with more space before running the
oc ibm-pak generate mirror-manifests
command.
TMPDIR=<new_temp_dir> oc ibm-pak generate mirror-manifests ${CASE_NAME} ${TARGET_REGISTRY} --version ${CASE_VERSION}
Where <new_temp_dir>
is the path for a directory with more space.