Troubleshooting installation and upgrade

Learn how to isolate and resolve problems with installing and upgrading IBM Cloud Pak for AIOps.

Known issues and limitations

In addition to the common problems and solutions that are listed, review the installation related known issues for potential issues that you might encounter during installation. For more information, see Known issues and limitations.

Troubleshooting tips

Note: If you have an aborted or stalled installation, uninstall any installed or partially installed components and restart the installation. For more information about uninstalling, see the following topics:

Installation issues

Upgrade issues

Issues common to installation and upgrade

General troubleshooting

General troubleshooting tips (Console)

  1. From the Red Hat® OpenShift® console, go to Workloads > Pods.

  2. Select the project that IBM Cloud Pak for AIOps is installed in from the drop-down menu to view the status of all the pods in this project.

  3. If you want to see the status and details for one of the pods, click the pod that you want to investigate.

  4. If you want to see the logs for the pod, click Logs.

General troubleshooting tips (CLI)

  • To find out the status of all the pods in a project, run the following command:

    oc get pods -n <project>
    

    Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.

  • To see the status of a particular pod, run the following command:

    oc get pods | grep <pod_name>
    

    Where <pod_name> is the pod that you want to see status for.

  • To see a pod's logs, run the following command:

    oc logs -f <pod_name> -n <project>
    

    Where

    • <pod_name> is the pod that you want to see status for.
    • <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.

Troubleshooting storage

Storage that is associated with persistent volumes claims (PVCs) reaches capacity. This issue can occur as storage requests and resources fluctuate over time.

Solution: If storage fills up, increase the size of the PVC. For more information about increasing the size of a PVC, see Storage class requirements.

Recovering IBM Cloud Pak for AIOps after a cluster restart

After a Red Hat® OpenShift® Container Platform cluster restart, your IBM Cloud Pak for AIOps deployment has not come up successfully.

Use the following steps to help you to identify any problems.

  1. Run the following command and then check that all of the OpenShift nodes come back up, and that they all report a STATUS of Ready.

    oc get nodes
    

    If all of the OpenShift nodes have not come back up, then see the Updating clusters and Troubleshooting sections in the OpenShift documentation Opens in a new tab for guidance.

  2. Run the following command and then check that the IBM Cloud Pak for AIOps pods all have a STATUS OF Running or Completed.

    oc get pods -n <project>
    

    Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.

  3. Check the logs of any pods that did not report a STATUS of Running or Completed in steps 2 and 3 to see whether you can identify a cause.

    oc logs -f <failed_pod_name> -n <project>
    

    Where

    • <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
    • <failed_pod_name> is a failed pod name, as returned in steps 2 and 3.

Solution: Restart any failed pods with the following command:

oc delete pod <failed_pod_name> -n <project>

Where

  • <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
  • <failed_pod_name> is the pod name as returned in steps 2 and 3.

Troubleshooting installation

Installation not proceeding after creating the IBM Cloud Pak for AIOps custom resource

When you are installing, the installation is not proceeding as expected after you create an instance of the IBM Cloud Pak for AIOps custom resource. For instance, after you create the instance, the installation process appears to be stalled, however no pods in the namespace and cluster display show errors.

This stalling of the installation process can be due to an issue with the Operator Lifecycle Manager (OLM), which causes the expected ClusterServiceVersion (CSV) for some operator subscriptions to not be found. For example, installPlan instances for the aimanager-operator, connector-utilities-operator, and other operators can be found, but without a reference to the required ClusterServiceVersion (CSV) YAML. While the operators can be created within your defined namespace, if the ClusterServiceVersion (CSV) cannot be found, the operators cannot be managed. If this issue occurs, the overall installation cannot proceed correctly.

Solution: Delete any operator subscription that does not have an associated ClusterServiceVersion (CSV) YAML and has an unresolved subscription. The installation process should re-create the associated operator and generate a new installPlan. If a new installPlan is not generated, repeat the process to delete the operator subscription and try again.

The AI manager operator does not progress past Reconciling, and the Cassandra setup pod is stuck

When installing, the AI Manager operator remains in Phase: Reconciling. The aiops-ir-analytics-cassandra-setup pod is in Crashloopbackoff state, and its log has the following error:

ERROR [2023-06-13 09:57:44,010] com.ibm.itsm.analytics.metric.service.cassandra.SessionManager: Cassandra not available.

Solution: Use the following steps to reset the Cassandra admin password, and delete the aiops-ir-analytics-cassandra-setup and aiops-ir-core-archiving-setup jobs.

  1. Run the following command to switch to your IBM Cloud Pak for AIOps project.

    oc project <project>
    

    Where <project> is the project that your IBM Cloud Pak for AIOps installation is deploying in.

  2. Reset the Cassandra admin password.

    oc rsh aiops-topology-cassandra-0 bash -c 'python3 -u /opt/ibm/change-superuser.pyc --initial --user admin --password $(cat /opt/ibm/cassandra_auth/password) --replication-factor $CASSANDRA_AUTH_REPLICATION_FACTOR'
    
  3. Delete the aiops-ir-analytics-cassandra-setup job, if it does not own any pods.

    1. Check if there are any pods belonging to the aiops-ir-analytics-cassandra-setup job.

      oc get pod -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
      
    2. If no pods are returned, then delete the aiops-ir-analytics-cassandra-setup job.

      oc delete job -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
      
  4. Delete the aiops-ir-core-archiving-setup job, if it does not own any pods.

    1. Check if there are any pods belonging to the aiops-ir-core-archiving-setup job.

      oc get pod -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
      
    2. If no pods are returned, then delete the aiops-ir-core-archiving-setup job.

      oc delete job -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
      

"Application is not available" message is displayed intermittently.

This is caused by a defect in some versions of Red Hat OpenShift. For more information, see OpenShift 4 - Intermittent 503 from all routes or timeouts when calling to backends directly when allow-from-openshift-ingress network policy is applied in the Red Hat OpenShift documentation.

Solution: Upgrade to a supported version of Red Hat OpenShift. For more information, see Supported Red Hat OpenShift versions.

Installation on Linux fails with network connectivity errors when using aiopsctl to install on vSphere VMs

The aiopsctl script installs the Kubernetes distribution k3s. When k3s is installed on a VMWare vSphere virtual machine (VM), routing across ClusterIPs does not work. The creation of the subscriptions for Cert Manager and the Licensing Service fail with errors similar to the following because they cannot reach the catalog:

- message: 'failed to populate resolver cache from source ibm-aiops-catalog/olm:
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing: dial tcp: lookup ibm-aiops-catalog.olm.svc
      on 10.43.0.10:53: read udp 10.42.1.2:41727->10.43.0.10:53: i/o timeout"'

This problem is caused by a known kernel issue with Red Hat® Enterprise Linux® on vSphere. To fix the problem, run the following command on each of your nodes:

ethtool -K flannel.1 tx-checksum-ip-generic off

For more information, see k3s support

Installation on Linux: deployment is unstable after cluster restart

After a cluster restart, one of IBM Cloud Pak for AIOps's components, zookeeper, cannot start due to issues around DNS resolution causing it to misidentify its assigned cluster IP address.

Running the following command shows that iaf-system-kafka-0 and some other pods failed:

kubectl get pods

Running the following command shows errors in the zookeeper logs:

kubectl logs -l app.kubernetes.io/name=zookeeper --tail -1

Example errors:

2024-06-17 20:57:26,674 ERROR Couldn't bind to iaf-system-zookeeper-0.iaf-system-zookeeper-nodes.aiops.svc/10.42.1.53:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
java.net.BindException: Cannot assign requested address
java.net.SocketException: Unresolved address

Solution: Run the following command to restart the zookeeper pods, and then wait a few minutes for the system to recover:

kubectl delete pods -l app.kubernetes.io/name=zookeeper

Installation on Linux: unable to log in to IBM Cloud Pak for AIOps

If you have a deployment of IBM Cloud Pak for AIOps on Linux and your Domain Name System (DNS) does not provide resolution for the cp-console hostname on all of your nodes, then you may not be able to log in to the Cloud Pak for AIOps console.

Run the following command from the control plane node to view the platform-identity-provider pod logs:

oc logs deployment/platform-identity-provider -n aiops

Example output where cp-console is not resolved:

{"name":"platform-identity-provider","hostname":"platform-identity-provider-79855f988-zqp2z","pid":1,"level":50,"err":{"message":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","name":"Error","stack":"Error: getaddrinfo ENOTFOUND cp-console-aiops.example.com\n    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:107:26)","code":"ENOTFOUND"},"msg":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","time":"2025-03-24T20:10:31.696Z","v":0}

Solution: Use one of the following options to resolve this issue:

Option 1: Configure DNS records to resolve the hosts for accessing the Cloud Pak for AIOps console, and ensure that the DNS is used by all of the nodes in the cluster. For more information, see DNS requirements.

Option 2: Configure CoreDNS to provide DNS resolution within the cluster. Use the following steps:

  1. Run the following command to create a ConfigMap with a custom configuration for the CoreDNS server:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: coredns-custom
      namespace: kube-system
    data:
      default.server: |
        cp-console-aiops.<example.com> {
            hosts {
                  <load_balancer_ip> cp-console-aiops.<example.com>
                  fallthrough
            }
        }
    EOF
    

    Where:

    • <example.com> is your hostname
    • <load_balancer_ip> is the IP address of your load balancer
  2. Run the following command to apply the configuration:

    oc -n kube-system rollout restart deployment coredns
    

Troubleshooting upgrade

After upgrade, policies do not process alerts due to problem with the aiops-ir-lifecycle-create-policies-job

This issue can occur if there is a problem connecting to Cassandra during upgrade. The aiops-ir-lifecycle-create-policies-job reports successful completion even it was unable to connect to Cassandra and run some of its policy-related queries. After upgrade policies do not process alerts, and error messages similar to the following are present in the aiops-ir-lifecycle-create-policies-job's log:

"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":40,"err":{"message":"All host(s) tried for query failed.

"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":50,"msg":"Failed to update policy with content

Solution: Restart the aiops-ir-lifecycle-create-policies-job by running the following command:

oc delete job aiops-ir-lifecycle-create-policies-job

After upgrade, alerts are not visible in the UI

  1. Run the following command to view the LifecycleTrigger job.

    oc get LifecycleTrigger aiops -o yaml
    

  2. Check the status section of the returned YAML.

    Example status section:

    status:
      conditions:
      - lastTransitionTime: "2024-04-15T10:12:44Z"
        message: Job '5e569b68014d9a669574096001ce51e6' in state 'RESTARTING'
        observedGeneration: 2
        reason: NotReady
        status: "False"
        type: LifecycleTriggerReady
      jobs:
      - jid: 5e569b68014d9a669574096001ce51e6
        startTime: "2024-04-13T20:22:17Z"
        state: RESTARTING
      savepoints:
      - path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-95d27a6574b4
        time: "2024-04-13T20:10:56Z"
      - lifecycleVersion: 4.5.1
        namespace: aiops
        path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd
        time: "2024-04-13T20:15:22Z"
      - lifecycleVersion: 4.5.1
        namespace: aiops
        path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd-upgraded
        time: "2024-04-13T20:21:53Z"
      - lifecycleVersion: 4.5.1
        namespace: aiops
        path: s3://aiops-ir-lifecycle/savepoints/savepoint-5e569b-6d60d2f35ca8
        time: "2024-04-13T20:23:20Z"
    

    Check to see if:

    • the conditions section has Job <...> in state 'RESTARTING'.
    • the savepoints section has an entry with a path that ends with -upgraded, which follows an entry with the same path without this suffix.
    • the two entries that are described above are followed by at least one more entry. This check is important.

Solution: If all of the above checks are true, then you can remediate this problem with the following steps.

  1. Note the value of jid in the status block (5e569b68014d9a669574096001ce51e6 in the example above) and export it as an environment variable.

    export JOB_ID=<jid>
    
  2. Run the following command to cancel the LifecycleTrigger job that is restarting.

    oc patch LifecycleTrigger aiops --type json --patch '[{"op":"add","path":"/spec/cancelJobs","value":["'${JOB_ID}'"]}]'
    
  3. Check the LifecycleTrigger YAML again.

    oc get LifecycleTrigger aiops -o yaml
    

    After about a minute the status.jobs section should have two entries:

    • The first entry has the old job ID and the state will change to CANCELLED.
    • the second entry has a new job ID and the state should eventually settle at RUNNING.

Navigation menu items disappear after an upgrade

You might notice an issue where navigation menu items in IBM Cloud Pak for AIOps console disappear when you upgrade .

To determine if you are encountering this issue, run the following command:

kubectl logs -l component=zen-core-api

The output can resemble the following example:

time="2024-07-18 13:46:00" level=error msg=GetNavExtensions func=zen-core-api/source/apis/internal_extensions/nav.GetNavItemsWrapper file="/go/src/zen-core-api/source/apis/internal_extensions/nav/get_nav_items.go:124" exception="circular dependency or dangling node found in aiops-nav-menuitem-alerts_page,aiops-connectors-nav,aiops-nav-menuitem-metric_search,aiops-insights-nav,aiops-nav-menuitem-incidents_page,aiops-aimodel-nav,aiops-nav-menuitem-resource_management,aiops-nav-menuitem-automations"

Solution: If you see the preceding error, then use the following step to resolve the menu items disappear issue:

  1. Delete the aiops-baseui-frontdoor-ext resource:

    oc delete zenextension aiops-baseui-frontdoor-ext
    

    The output can resemble the following example:

    zenextension.zen.cpd.ibm.com "aiops-baseui-frontdoor-ext" deleted
    

Issues that are common to installation and upgrade

Offline install or upgrade throws 'invalid character' error

When doing an offline install or upgrade, running the oc ibm-pak generate mirror-manifests <..> command throws an error similar to the following:

Error: failed to load the catalog FBC at cp.stg.icr.io/cp/ <...> invalid character '<' in string escape code

Solution: You must have IBM Catalog Management Plug-in for IBM Cloud Pak (ibm-pak-plugin) v1.10 or higher installed. Run the following commands to ensure that ibm-pak-plugin is at the required level.

  1. Check which version of ibm-pak-plugin you have installed.

    Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.

    oc ibm-pak --version
    

    Example output:

    oc ibm-pak --version
    v1.11.0
    

  2. If the ibm-pak-plugin version is lower than v1.10.0, then you must download and install the most recent version.

    Follow the steps for your installation approach:

Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'

The oc ibm-pak generate mirror-manifests $CASE_NAME $TARGET_REGISTRY --version $CASE_VERSION command fails with a message similar to the following in $IBMPAK_HOME/.ibm-pak/logs/oc-ibm_pak.log:

write /tmp/render-unpack-4002583241/var/lib/rpm/Packages: no space left on device

Solution: The default temporary directory does not have enough space to run the ibm-pak tool. You must set the TMPDIR environment variable to a different directory with more space before running the oc ibm-pak generate mirror-manifests command.

TMPDIR=<new_temp_dir> oc ibm-pak generate mirror-manifests ${CASE_NAME} ${TARGET_REGISTRY} --version ${CASE_VERSION}

Where <new_temp_dir> is the path for a directory with more space.