Troubleshooting installation and upgrade

Learn how to isolate and resolve problems with installing and upgrading IBM Cloud Pak for AIOps.

Known issues and limitations

In addition to the common problems and solutions that are listed, review the installation related known issues for potential issues that you might encounter during installation. For more information, see Known issues and limitations.

Troubleshooting tips

General troubleshooting tips (Console)
General troubleshooting tips (CLI)
Troubleshooting storage
Recovering IBM Cloud Pak for AIOps after a cluster restart

Note: If you have an aborted or stalled installation, uninstall any installed or partially installed components and restart the installation. For more information about uninstalling, see the following topics:

Installation issues

Installation fails to proceed when creating the IBM Cloud Pak for AIOps custom resource
The AI manager operator does not progress past Reconciling, and the Cassandra setup pod is stuck
"Application is not available" message is displayed intermittently
Installation stalls with aiops-ir-analytics-cassandra-setup job in CrashLoopBackOff
Installation on Linux only: fails with network connectivity errors when using aiopsctl to install on vSphere VMs
Installation on Linux only: deployment is unstable after cluster restart
Installation on Linux only: unable to log in to IBM Cloud Pak for AIOps
Pods stuck after running aiopsctl node up, with error: "failed to pull image" (deployments on Linux only)

Upgrade issues

After upgrade, policies do not process alerts due to problem with the aiops-ir-lifecycle-create-policies-job
After upgrade, alerts are not visible in the UI

Issues common to installation and upgrade

Offline install or upgrade throws 'invalid character' error
Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'

General troubleshooting

General troubleshooting tips (Console)

From the Red Hat® OpenShift® console, go to Workloads > Pods.
Select the project that IBM Cloud Pak for AIOps is installed in from the drop-down menu to view the status of all the pods in this project.
If you want to see the status and details for one of the pods, click the pod that you want to investigate.
If you want to see the logs for the pod, click Logs.

General troubleshooting tips (CLI)

To find out the status of all the pods in a project, run the following command:
```
oc get pods -n <project>
```
Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
To see the status of a particular pod, run the following command:
```
oc get pods | grep <pod_name>
```
Where <pod_name> is the pod that you want to see status for.
To see a pod's logs, run the following command:
```
oc logs -f <pod_name> -n <project>
```
Where
- <pod_name> is the pod that you want to see status for.
- <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.

Troubleshooting storage

Storage that is associated with persistent volumes claims (PVCs) reaches capacity. This issue can occur as storage requests and resources fluctuate over time.

Solution: If storage fills up, increase the size of the PVC. For more information about increasing the size of a PVC, see Storage class requirements.

Recovering IBM Cloud Pak for AIOps after a cluster restart

After a Red Hat® OpenShift® Container Platform cluster restart, your IBM Cloud Pak for AIOps deployment has not come up successfully.

Use the following steps to help you to identify any problems.

Run the following command and then check that all of the OpenShift nodes come back up, and that they all report a STATUS of Ready.
```
oc get nodes
```
If all of the OpenShift nodes have not come back up, then see the Updating clusters and Troubleshooting sections in the OpenShift documentation for guidance.
Run the following command and then check that the IBM Cloud Pak for AIOps pods all have a STATUS OF Running or Completed.
```
oc get pods -n <project>
```
Where <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
Check the logs of any pods that did not report a STATUS of Running or Completed in steps 2 and 3 to see whether you can identify a cause.
```
oc logs -f <failed_pod_name> -n <project>
```
Where
- <project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
- <failed_pod_name> is a failed pod name, as returned in steps 2 and 3.

Solution: Restart any failed pods with the following command:

oc delete pod <failed_pod_name> -n <project>

Where

<project> is the project (namespace) where IBM Cloud Pak for AIOps is deployed.
<failed_pod_name> is the pod name as returned in steps 2 and 3.

Troubleshooting installation

Installation not proceeding after creating the IBM Cloud Pak for AIOps custom resource

When you are installing, the installation is not proceeding as expected after you create an instance of the IBM Cloud Pak for AIOps custom resource. For instance, after you create the instance, the installation process appears to be stalled, however no pods in the namespace and cluster display show errors.

This stalling of the installation process can be due to an issue with the Operator Lifecycle Manager (OLM), which causes the expected ClusterServiceVersion (CSV) for some operator subscriptions to not be found. For example, installPlan instances for the aimanager-operator, connector-utilities-operator, and other operators can be found, but without a reference to the required ClusterServiceVersion (CSV) YAML. While the operators can be created within your defined namespace, if the ClusterServiceVersion (CSV) cannot be found, the operators cannot be managed. If this issue occurs, the overall installation cannot proceed correctly.

Solution: Delete any operator subscription that does not have an associated ClusterServiceVersion (CSV) YAML and has an unresolved subscription. The installation process should re-create the associated operator and generate a new installPlan. If a new installPlan is not generated, repeat the process to delete the operator subscription and try again.

The AI manager operator does not progress past `Reconciling`, and the Cassandra setup pod is stuck

When installing, the AI Manager operator remains in Phase: Reconciling. The aiops-ir-analytics-cassandra-setup pod is in Crashloopbackoff state, and its log has the following error:

ERROR [2023-06-13 09:57:44,010] com.ibm.itsm.analytics.metric.service.cassandra.SessionManager: Cassandra not available.

Solution: Use the following steps to reset the Cassandra admin password, and delete the aiops-ir-analytics-cassandra-setup and aiops-ir-core-archiving-setup jobs.

Run the following command to switch to your IBM Cloud Pak for AIOps project.
```
oc project <project>
```
Where <project> is the project that your IBM Cloud Pak for AIOps installation is deploying in.

Reset the Cassandra admin password.

oc rsh aiops-topology-cassandra-0 bash -c 'python3 -u /opt/ibm/change-superuser.pyc --initial --user admin --password $(cat /opt/ibm/cassandra_auth/password) --replication-factor $CASSANDRA_AUTH_REPLICATION_FACTOR'

Delete the aiops-ir-analytics-cassandra-setup job, if it does not own any pods.
1. Check if there are any pods belonging to the aiops-ir-analytics-cassandra-setup job.
```
oc get pod -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
```
2. If no pods are returned, then delete the aiops-ir-analytics-cassandra-setup job.
```
oc delete job -l app.kubernetes.io/managed-by=aiops-analytics-operator,app.kubernetes.io/component=cassandra-setup
```
Delete the aiops-ir-core-archiving-setup job, if it does not own any pods.
1. Check if there are any pods belonging to the aiops-ir-core-archiving-setup job.
```
oc get pod -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
```
2. If no pods are returned, then delete the aiops-ir-core-archiving-setup job.
```
oc delete job -l app.kubernetes.io/managed-by=ir-core-operator,app.kubernetes.io/component=archiving-setup
```

"Application is not available" message is displayed intermittently.

This is caused by a defect in some versions of Red Hat OpenShift. For more information, see OpenShift 4 - Intermittent 503 from all routes or timeouts when calling to backends directly when allow-from-openshift-ingress network policy is applied in the Red Hat OpenShift documentation.

Solution: Upgrade to a supported version of Red Hat OpenShift. For more information, see Supported Red Hat OpenShift versions.

Installation on Linux fails with network connectivity errors when using `aiopsctl` to install on vSphere VMs

The aiopsctl script installs the Kubernetes distribution k3s. When k3s is installed on a VMWare vSphere virtual machine (VM), routing across ClusterIPs does not work. The creation of the subscriptions for Cert Manager and the Licensing Service fail with errors similar to the following because they cannot reach the catalog:

- message: 'failed to populate resolver cache from source ibm-aiops-catalog/olm:
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing: dial tcp: lookup ibm-aiops-catalog.olm.svc
      on 10.43.0.10:53: read udp 10.42.1.2:41727->10.43.0.10:53: i/o timeout"'

This problem is caused by a known kernel issue with Red Hat® Enterprise Linux® on vSphere. To fix the problem, run the following command on each of your nodes:

ethtool -K flannel.1 tx-checksum-ip-generic off

For more information, see k3s support

Installation on Linux: deployment is unstable after cluster restart

After a cluster restart, one of IBM Cloud Pak for AIOps's components, zookeeper, cannot start due to issues around DNS resolution causing it to misidentify its assigned cluster IP address.

Running the following command shows that iaf-system-kafka-0 and some other pods failed:

kubectl get pods

Running the following command shows errors in the zookeeper logs:

kubectl logs -l app.kubernetes.io/name=zookeeper --tail -1

Example errors:

2024-06-17 20:57:26,674 ERROR Couldn't bind to iaf-system-zookeeper-0.iaf-system-zookeeper-nodes.aiops.svc/10.42.1.53:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
java.net.BindException: Cannot assign requested address
java.net.SocketException: Unresolved address

Solution: Run the following command to restart the zookeeper pods, and then wait a few minutes for the system to recover:

kubectl delete pods -l app.kubernetes.io/name=zookeeper

Installation on Linux: unable to log in to IBM Cloud Pak for AIOps

If you have a deployment of IBM Cloud Pak for AIOps on Linux and your Domain Name System (DNS) does not provide resolution for the cp-console hostname on all of your nodes, then you may not be able to log in to the Cloud Pak for AIOps console.

Run the following command from the control plane node to view the platform-identity-provider pod logs:

oc logs deployment/platform-identity-provider -n aiops

Example output where cp-console is not resolved:

{"name":"platform-identity-provider","hostname":"platform-identity-provider-79855f988-zqp2z","pid":1,"level":50,"err":{"message":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","name":"Error","stack":"Error: getaddrinfo ENOTFOUND cp-console-aiops.example.com\n    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:107:26)","code":"ENOTFOUND"},"msg":"getaddrinfo ENOTFOUND cp-console-aiops.example.com","time":"2025-03-24T20:10:31.696Z","v":0}

Solution: Use one of the following options to resolve this issue:

Option 1: Configure DNS records to resolve the hosts for accessing the Cloud Pak for AIOps console, and ensure that the DNS is used by all of the nodes in the cluster. For more information, see DNS requirements.

Option 2: Configure CoreDNS to provide DNS resolution within the cluster. Use the following steps:

Run the following command to create a ConfigMap with a custom configuration for the CoreDNS server:

oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  default.server: |
    cp-console-aiops.<example.com> {
        hosts {
              <load_balancer_ip> cp-console-aiops.<example.com>
              fallthrough
        }
    }
EOF

Where:

<example.com> is your hostname
<load_balancer_ip> is the IP address of your load balancer

Run the following command to apply the configuration:

oc -n kube-system rollout restart deployment coredns

Pods stuck after running `aiopsctl node up`, with error: "failed to pull image" (deployments on Linux only)

If you are using a proxy and pods are stuck in ContainerCreating or ImagePullBackOff after running the aiopsctl node up command, you might have missing or incorrect proxy settings on one or more nodes that prevents successful image pulls.

Run the following command to view errors for the pod that is stuck.

oc describe pod <pod-name> -n <namespace>

Example output where image pulls fail because of missing or incorrect proxy configuration:

FailedCreatePodSandBox  1m  kubelet  Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = failed to pull image ...
dial tcp <registry-ip>:443: i/o timeout

Verify whether your system environment configuration settings are set up correctly for your proxy.

Run the following command on each of your control plane nodes:
```
cat /etc/systemd/system/k3s.service.env
```
Run the following command on each of your worker nodes:
```
cat /etc/systemd/system/k3s-agent.service.env
```
Example output:
```
HTTP_PROXY=<Your proxy>
HTTPS_PROXY=<Your proxy>
NO_PROXY=127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 # sample list of IPs
```
Check that the NO_PROXY variable includes the IP address ranges for the public and private IPs of the cluster nodes.
If the proxy variables are missing or incorrect, correct them and apply your changes with the following steps.
1. Add or update them in the appropriate .env file (/etc/systemd/system/k3s.service.env or /etc/systemd/system/k3s-agent.service.env).
2. Restart the k3s service.
  
  Run the following command on each of the control plane nodes:
```
sudo systemctl restart k3s
```
  Run the following command on each of the worker nodes:
```
sudo systemctl restart k3s-agent
```
3. Validate connectivity and image pulls.
  
  Run the following command to manually pull an image and confirm network access:
```
crictl pull $(cat /etc/rancher/k3s/config.yaml | grep "pause-image:" | sed 's/pause-image: //')
```
  If the image pull succeeds, then the proxy configuration is now correct.

Installation stalls with aiops-ir-analytics-cassandra-setup job in CrashLoopBackOff

Installation stalls because the aiops-ir-analytics-cassandra-setup pod is in CrashLoopBackOff state, and its log has an error similar to the following:

ERROR [2025-05-05 08:54:52,906] com.ibm.itsm.topology.service.app.BaseServiceApp: java.lang.IllegalStateException: Schema is not in agreement, wait and retry

Solution: Restart the Cassandra StatefulSets.

Run the following command to export an environment variable for your IBM Cloud Pak for AIOps project.
```
export AIOPS_NAMESPACE=<project>
oc project <project>
```
Where <project> is the project that your IBM Cloud Pak for AIOps installation is deploying in.

Make a note of the number of replicas for Cassandra.

oc get statefulset aiops-topology-cassandra -n ${AIOPS_NAMESPACE}

Example output for 3 Cassandra StatefulSet replicas:

NAME                       READY   AGE
aiops-topology-cassandra   0/3     43m

Scale down the Cassandra StatefulSet replicas to 0.

oc scale statefulsets aiops-topology-cassandra --replicas=0 -n ${AIOPS_NAMESPACE}

Manually scale up the first Cassandra StatefulSet replica.
```
oc scale statefulsets aiops-topology-cassandra --replicas=1 -n ${AIOPS_NAMESPACE}
```
Wait until the node is ready.
Rerun step 4 with an incremented number for replicas, and wait until the node is ready.

For example,
```
oc scale statefulsets aiops-topology-cassandra --replicas=2 -n ${AIOPS_NAMESPACE}
```
Repeat this step until you have the number of Cassandra StatefulSet replicas that you noted in step 2.

Troubleshooting upgrade

After upgrade, policies do not process alerts due to problem with the `aiops-ir-lifecycle-create-policies-job`

This issue can occur if there is a problem connecting to Cassandra during upgrade. The aiops-ir-lifecycle-create-policies-job reports successful completion even it was unable to connect to Cassandra and run some of its policy-related queries. After upgrade policies do not process alerts, and error messages similar to the following are present in the aiops-ir-lifecycle-create-policies-job's log:

"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":40,"err":{"message":"All host(s) tried for query failed.

"name":"installdefaultpolicies","hostname":"aiops-ir-lifecycle-create-policies-job-c8k7t","pid":42,"level":50,"msg":"Failed to update policy with content

Solution: Restart the aiops-ir-lifecycle-create-policies-job by running the following command:

oc delete job aiops-ir-lifecycle-create-policies-job

After upgrade, alerts are not visible in the UI

Run the following command to view the LifecycleTrigger job.
```
oc get LifecycleTrigger aiops -o yaml
```

Check the status section of the returned YAML.

Example status section:

status:
  conditions:
  - lastTransitionTime: "2024-04-15T10:12:44Z"
    message: Job '5e569b68014d9a669574096001ce51e6' in state 'RESTARTING'
    observedGeneration: 2
    reason: NotReady
    status: "False"
    type: LifecycleTriggerReady
  jobs:
  - jid: 5e569b68014d9a669574096001ce51e6
    startTime: "2024-04-13T20:22:17Z"
    state: RESTARTING
  savepoints:
  - path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-95d27a6574b4
    time: "2024-04-13T20:10:56Z"
  - lifecycleVersion: 4.5.1
    namespace: aiops
    path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd
    time: "2024-04-13T20:15:22Z"
  - lifecycleVersion: 4.5.1
    namespace: aiops
    path: s3://aiops-ir-lifecycle/savepoints/savepoint-d9c1a5-40e97c1c5bdd-upgraded
    time: "2024-04-13T20:21:53Z"
  - lifecycleVersion: 4.5.1
    namespace: aiops
    path: s3://aiops-ir-lifecycle/savepoints/savepoint-5e569b-6d60d2f35ca8
    time: "2024-04-13T20:23:20Z"

Check to see if:

the conditions section has Job <...> in state 'RESTARTING'.
the savepoints section has an entry with a path that ends with -upgraded, which follows an entry with the same path without this suffix.
the two entries that are described above are followed by at least one more entry. This check is important.

Solution: If all of the above checks are true, then you can remediate this problem with the following steps.

Note the value of jid in the status block (5e569b68014d9a669574096001ce51e6 in the example above) and export it as an environment variable.
```
export JOB_ID=<jid>
```

Run the following command to cancel the LifecycleTrigger job that is restarting.

oc patch LifecycleTrigger aiops --type json --patch '[{"op":"add","path":"/spec/cancelJobs","value":["'${JOB_ID}'"]}]'

Check the LifecycleTrigger YAML again.
```
oc get LifecycleTrigger aiops -o yaml
```
After about a minute the status.jobs section should have two entries:
- The first entry has the old job ID and the state will change to CANCELLED.
- the second entry has a new job ID and the state should eventually settle at RUNNING.

Issues that are common to installation and upgrade

Offline install or upgrade throws 'invalid character' error

When doing an offline install or upgrade, running the oc ibm-pak generate mirror-manifests <..> command throws an error similar to the following:

Error: failed to load the catalog FBC at cp.stg.icr.io/cp/ <...> invalid character '<' in string escape code

Solution: You must have IBM Catalog Management Plug-in for IBM Cloud Pak (ibm-pak-plugin) v1.10 or higher installed. Run the following commands to ensure that ibm-pak-plugin is at the required level.

Check which version of ibm-pak-plugin you have installed.

Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.
```
oc ibm-pak --version
```
Example output:
```
oc ibm-pak --version
v1.11.0
```
If the ibm-pak-plugin version is lower than v1.10.0, then you must download and install the most recent version.

Follow the steps for your installation approach:
- Bastion host: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
- Portable device: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.

Offline install or upgrade stuck because the 'oc ibm-pak generate mirror-manifests' command fails with 'no space left on device'

The oc ibm-pak generate mirror-manifests $CASE_NAME $TARGET_REGISTRY --version $CASE_VERSION command fails with a message similar to the following in $IBMPAK_HOME/.ibm-pak/logs/oc-ibm_pak.log:

write /tmp/render-unpack-4002583241/var/lib/rpm/Packages: no space left on device

Solution: The default temporary directory does not have enough space to run the ibm-pak tool. You must set the TMPDIR environment variable to a different directory with more space before running the oc ibm-pak generate mirror-manifests command.

TMPDIR=<new_temp_dir> oc ibm-pak generate mirror-manifests ${CASE_NAME} ${TARGET_REGISTRY} --version ${CASE_VERSION}

Where <new_temp_dir> is the path for a directory with more space.