Troubleshooting

It is necessary to collect specific information about your deployments to measure their well-being and to diagnose a problem before you open a support case on Cloud Pak for Business Automation.
Tip: The Troubleshooting capabilities links at the bottom go directly to the troubleshooting sections of the different capabilities.

Collect specific data about your environment and your Cloud Pak installation before you contact IBM support for assistance with a Cloud Pak for Business Automation issue. Always provide a detailed description of the problem and your environment.

When you run diagnostic commands, run them from an empty directory to package the files more cleanly. Run the commands from the namespace in which you observe the problematic container or component. For more information, see Mustgather: Collecting data to diagnose issues.

The OpenShift MustGather CLI command collects information from your cluster, which can be used to debug issues. You can specify one or more images when you run the command by including the --image argument. When you specify an image, the tool collects data that is related to that image.

The ibm-cp4a-operator locates the Cloud Pak base images and has Ansible® roles to handle the reconciliation logic and declare a set of playbook tasks for each component. The roles declare all the variables and defaults for how the role is executed.

The operator deployment creates a container on your cluster for the operator. The following diagram shows how the operator watches for events, triggers an Ansible role when a custom resource changes, and then reconciles the resources for the deployed applications.

Operator workflow

Use the following sections to find the information that you are looking for.

Collecting data

Depending on the type of operator, different logs are more useful. Use the following table to choose the Ansible or Go logs.

Table 1. Operator types
Capability Type of operator Operator name
CP4BA (multi-pattern) Ansible ibm-cp4a-operator
CP4BA FileNet Content Manager Ansible ibm-content-operator
CP4BA Operational Decision Manager Ansible ibm-odm-operator
CP4BA Automation Document Processing Ansible ibm-dpe-operator
CP4BA Automation Decision Service Go ibm-ads-operator
CP4BA Workflow Process Server Go ibm-cp4a-wfps-operator
CP4BA Process Federation Server Go ibm-cp4a-pfs-operator
CP4BA Workflow Runtime and Workstreams Services Go ibm-workflow-operator
CP4BA Business Automation Insights Ansible ibm-insights-engine-operator

The following describes how to get additional logs, information about pods, secrets, and events that might help with troubleshooting.

Getting the logs of the Ansible-based operators
To get the log of the latest reconciliation for Ansible-based operators, run the following command:
# <Must set> Set your project name here 
export project_name=$your_project_name

# <Must set> Set target operator name here 
export operator_name=$operator_name

operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}')
kubectl exec -i $operator_pod_name -n $project_name -- /bin/bash -c 'cat /tmp/ansible-operator/runner/icp4a.ibm.com/v1/*/*/*/artifacts/latest/stdout' > operator-ansible.log 

Optional: Export the history of the Ansible logs.

Ansible operators keep a backup of the logs under /logs/$operator_pod_name/ansible-operator/runner/<group>/<version>/<kind>/<namespace>/<name>/artifacts. The log contains information on the first 10 reconciles, including the latest reconcile. The following commands copy the logs to a local directory. Select the operator name for which you want to export the log.

# <Must set> Set your project name here 
export project_name=$your_project_name

export deployment_name=$(kubectl get icp4acluster | awk '{print $1}' | grep -v "NAME")

# Below can export CP4BA Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from Content Operator
export operator_pod_name=$(kubectl get pod|grep ibm-cp4a-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/ICP4ACluster/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log

# Below can export Content Operator's Ansible log to /tmp/$operator_pod_name-log, only need this when Content pattern involved.
export operator_pod_name=$(kubectl get pod|grep ibm-content-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/Content/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log

# Below can export Foundation Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from CP4BA Operator
export operator_pod_name=$(kubectl get pod|grep icp4a-foundation-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/Foundation/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log
Note: If you see "Cannot stat: No such file or directory" when you export the Ansible logs, it means that either no log that is generated from the current operator or the current operator is in its first reconcile.

Optional: Edit the verbosity of the Ansible logs.

If the operator log does not provide the level of detail that you need, you can gather more details by adding an annotation like the following example to your custom resource YAML:

metadata:
 ...
   annotations:
     ansible.sdk.operatorframework.io/verbosity: "3"
spec:

For the verbosity value, the normal rules for Ansible verbosity apply, where higher values mean more output. Acceptable values range from 0 (only the most severe messages are output) to 7 (all debugging messages are output). After you update the custom resource YAML, reapply the YAML for the changes to take effect.

Getting the logs of the Go-based operators
To get the log for go-based operators, run the following command:
kubectl logs deployment/$operator_name -n $project_name > operator.log
Getting information about pending pods
If some pods are pending, choose one of the pods, and run the following command to get more information.
kubectl describe pod <podname> 
Getting information about secrets
Kubernetes secrets are used extensively, so output about them might also be useful.
kubectl get secrets
Getting information about events
Kubernetes events are objects that provide more insight into what is happening inside a cluster, such as what decisions the scheduler makes or why some pods are evicted from a node. To get information about these events, run the following command.
kubectl get events > events.log

You can also add the verbose parameter to any kubectl command.

kubectl -v=9 get pods
Enabling Liberty tracing for Liberty-based CP4BA pods

For FNCM, BAN, and ADP pods use the following steps to enable a WebSphere® Application Server Liberty logging trace specification:

  1. Create a custom_server.xml file with a custom Liberty trace specification. A Liberty trace specification can vary and depends on why you are enabling it. The specification might come from IBM support or Liberty support.
  2. Copy the custom_server.xml file into the target pod under the /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides directory. This directory is mapped to a PVC where the configuration file can be persisted.

    WebSphere Application Server Liberty immediately detects the server configuration file and creates a trace.log file in the default directory, or in a custom directory if you specified one in the custom_server.xml file.

Resolving issues

EDB Postgres instance is in a fenced state

If the EDB Postgres instance (postgres-cp4ba) is in the Fenced status, you see the following message in the logs:

"msg":"Instance is fenced, won't start postgres right now","logging_pod":"postgres-cp4ba-1"

Run the following command to view the logs:

oc logs -n <CP4BA_namespace> -l k8s.enterprisedb.io/cluster=postgres-cp4ba

You can resolve the issue by running the following command, where the minus "-" sign at the end is the kubectl command to remove the "k8s.enterprisedb.io/fencedInstances" annotation.

kubectl annotate cluster.postgresql.k8s.enterprisedb.io postgres-cp4ba k8s.enterprisedb.io/fencedInstances-

For more information, see Fencing on the EDB docs.

Access routes return a 404 error
If the URLs for the installed Cloud Pak for Business Automation components in the cp4ba-access-info ConfigMap return a 404 error despite the operator logs showing no errors, then it is possible that a Zen extension did not start properly. The issue can be resolved by deleting the uncompleted Zen extensions and let the operators restart them. To get the list of installed Zen extensions, run the following command.
oc get zenextension 

The following command provides an example of how to delete a Zen extension.

oc delete zenextension icp4adeploy-<component>-zen-extension 
Cannot connect to the web client when accessing Navigator
If you see an error message that states a client cannot connect to the web client, then refresh your browser and the connection message goes away.

The cannot connect message appears during a relatively short window of time when the back-end Navigator pod is rescheduled. For example, when you make an update to the Navigator admin desktop properties, or you create a new role or policy. Sometimes these actions prompt connection errors, but usually it writes a message that states the server is unavailable.

Re-creating the image pull secret
If your Docker registry secret expires, you can delete the secret and re-create it:
oc delete secret ibm-entitlement-key -n <namespace>
oc create secret docker-registry ibm-entitlement-key --docker-server=image-registry.openshift-image-registry.svc:5000 --docker-username=kubeadmin --docker-password=$(oc whoami -t)
Applying changes by restarting pods
Sometimes, changes that you make in the custom resource YAML by using the operator or directly in the environment are not automatically propagated to all pods. For example, modifications to data source information or changes to Kubernetes secrets are not seen by running pods until the pods are restarted.

If changes applied by the operator or other modifications that are made in the environment do not provide the expected result, restart the pods by scaling the impacted deployments down to 0 then up to the number that you want to have. Kubernetes (OpenShift) terminates the existing pods and creates new ones.

CrashLoopBackOff status when an ODF storage class is used
If you install a CP4BA instance that uses an ODF storage class, you might see some pods that fail to be ready after the OCP cluster is rebooted.

To resolve the issue, manually restart the pods that fail to be ready.

Directory mount failure prevents pod readiness
If a pod stays in a CreateContainerError state, and the description of the problem includes similar text to the following message then remove the failing mounted path.
Warning  Failed  43m  kubelet  Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/icp4adeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**

The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.

You can remove a problematic folder from a deployment in two ways:

  • If you can access the persistent volume, go to the mounted path and delete it. You can get the path to the folder by running the following command.
    oc describe pv $pv_name
  • If you cannot access the persistent volume, edit the deployment by removing the failed mount.
    1. Edit the deployment by running the oc edit deployment <deployment_name> command. The following lines show an example mountPath:
      - mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
                name: config-volume
                subPath: ibm_oidc_rp.xml
    2. You can then access the pod when it is Running by using the oc exec -it command.
      oc exec -it icp4adeploy-cmis-deploy-5cd4774f78-mg6pw bash
    3. Delete the file with the rm command.
      rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml

When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.

Cannot log in to the Zen console
After installation, you might not be able to log in to the Zen console by using the default cluster administrator cpadmin username. The cause of this problem is that the name cpadmin might also exist in the LDAP directory.

To resolve the login issue, use the following steps.

  1. Change the name of the cpadmin user in the platform-auth-idp-credentials secret.
  2. Change the cluster-wide role binding oidc-admin-binding to the new username.
  3. Log in to the Zen console UI by using the OpenShift Credentials.
  4. Add new admin users in the console.

For more information, see Changing the Cloud Pak administrator username.

Issues trying to install after you uninstalled
If you see issues when you install a new instance on a cluster that you already used for a Cloud Pak deployment, check if the foundational services dependencies are properly deleted.

For more information, see Uninstallation does not remove all components.

Profile size does not scale down

When you decrease the pattern profile size after installation, from large to medium or from medium to small, Cloud Pak foundational services do not scale down with the profile size change. This behavior is expected. For more information about profile sizes, see System requirements.

Operator pod in OOMKilled status
If you see the Cloud Pak for Business Automation operator or any operator pod with a status OOMKilled, it means that the resources that are allocated to the operator pod is not enough for the workload. You can modify the csv to give the operator more resources. The following example can be adjusted to get the operator pod up and running again. You can find the csv name by "oc get csv -n $operator_namespace" and identify which one needs to change.
oc patch csv ibm-cp4a-operator.v24.0.0 --type=json -p '[
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/cpu",
"value": "4"
},
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory",
"value": "8Gi"
},
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/cpu",
"value": "1500m"
},
{
"op":"replace",
"path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/memory",
"value": "1600Mi"
},
]'
Business Performance Center dashboards are missing
If Business Performance Center dashboards do not appear when you log in to the Business Performance Center web page:
  1. Get the cockpit pod by running the following command:
    kubectl get pod |grep insights-engine-cockpit
  2. Delete the cockpit pod.
  3. Delete cp4ba operator pod.
Business Teams Service (BTS) cannot be installed

If BTS fails to install and you see the following messages in the logs, then it means that the ibm-bts-oidc-client secret cannot be created.

oc logs ibm-bts-operator-controller-manager-instance-name|tail -10
IAM Client secret not yet found, retry after 5 seconds...

To resolve the issue, create a shell script with an oc exec -it command, and then run the script in the namespace where your foundational services are installed. For more information, see Steps to follow if certificate import is an issue.

Operator pods get evicted due to the lack of ephemeral storage
Some of the operator pods might get evicted when the pre-defined ephemeral storage is full. The causes to this problem might include, but are not limited to:
  • Debug logging is turned on.
  • Retrying logic of the operator is waiting for some resources to be available.
Typically, the pod might show the following error when you get the description of the pod:
Status:       Failed
Reason:       Evicted
Message:      Pod ephemeral local storage usage exceeds the total limit of containers 500Mi. 
Follow these steps to resolve the problem:
  1. Determine the CSV that the operator belongs to. Typically, you can predict the CSV name by looking at the operator's pod name. For example, if the operator's pod name is ibm-dpe-operator-59797bd587-clqv7, then you can find its CSV with the following command:
    CSV=$(oc get csv |grep ibm-dpe-operator | awk '{print $1}') 
    echo $CSV
  2. Update the ephemeral storage to a specific value. In the following example, the new ephemeral storage size is 800Mi:
    oc patch csv "$CSV" --type="json" -p="[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/ephemeral-storage", "value": "800Mi"}]" 

After updating the value, the operator pod is restarted.

CP4BA operators cannot connect to the OCP API endpoint

If you see the following error on your cluster, then change the value of the namespaces for the API Server parameters (sc_api_namespace) in the sc_egress_configuration section of the CP4BA custom resource to "{}".

"stderr": "Error from server (InternalError): an error on the server 
(\"dial tcp XXX.XX.X.X:443: i/o timeout\") has prevented the request from succeeding", 
"stderr_lines": ["Error from server (InternalError): an error on the server 
(\"dial tcp XXX.XX.X.X:443: i/o timeout\") has prevented the request from succeeding"]

For more information, see Shared configuration parameters.

Troubleshooting capabilities

The custom resource can be configured to enable and disable specific logging parameters, log levels, log formats, and where these logs are stored for the various capabilities. If you need more information about specific Cloud Pak capabilities, go to the relevant troubleshooting topics.