Troubleshooting
It is necessary to collect specific information about your deployments to measure their well-being and to diagnose a problem before you open a support case on Cloud Pak for Business Automation.
Before you begin
You must collect specific data about your environment and your Cloud Pak installation before you contact IBM support for assistance with a Cloud Pak for Business Automation issue. You must provide a detailed description of the problem and your environment.
When you run diagnostic commands, run them from an empty directory to package the files more cleanly. Run the commands from the namespace in which you observe the problematic container or component. For more information, see Collecting data to diagnose issues.
The OpenShift must-gather CLI command collects information from your cluster, which can be used
to debug issues. You can specify one or more images when you run the command by including the
--image
argument. When you specify an image, the tool collects data that is related
to that image.
A must-gather extension image for all IBM Cloud Paks is also available at: opencloudio/must-gather.
You can collect logs by running the following command:
oc adm must-gather --image=quay.io/opencloudio/must-gather:latest
For more information about collecting the logs, see Collecting support information about the cluster.
About this task
The ibm-cp4a-operator
locates the Cloud Pak base images and has Ansible® roles to handle the reconciliation logic
and declare a set of playbook tasks for each component. The roles declare all the variables and
defaults for how the role is executed.
The operator deployment creates a container on your cluster for the operator. The following diagram shows how the operator watches for events, triggers an Ansible role when a custom resource changes, and then reconciles the resources for the deployed applications.
Depending on the type of operator, different logs are more useful. Use the following table to choose the Ansible or Go logs.
Capability | Type of operator | Operator name |
---|---|---|
CP4BA (multi-pattern) | Ansible | ibm-cp4a-operator |
CP4BA FileNet Content Manager | Ansible | ibm-content-operator |
Automation Foundation | Ansible | ibm-foundation-operator |
CP4BA Workflow Process Server | Go | ibm-cp4a-wfps-operator |
CP4BA Process Federation Server | Go | ibm-cp4a-pfs-operator |
- Getting the logs of the Go-based operators
- To get the log for go-based operators, run the following command:
kubectl logs deployment/$operator_name -n $project_name > operator.log
- Getting the logs of the Ansible-based operators
- To get the log of the latest reconciliation for Ansible-based operators, run the
following command:
# <Must set> Set your project name here export project_name=$your_project_name # <Must set> Set target operator name here export operator_name=$operator_name operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}') kubectl exec -i $operator_pod_name -n $project_name -- /bin/bash -c 'cat /tmp/ansible-operator/runner/icp4a.ibm.com/v1/*/*/*/artifacts/latest/stdout' > operator-ansible.log
Optional: Export the history of the Ansible logs.
Ansible operators keep a backup of the logs under /logs/$operator_pod_name/ansible-operator/runner/<group>/<version>/<kind>/<namespace>/<name>/artifacts. The log contains information on the first 10 reconciles, including the latest reconcile. The following commands copy the logs to a local directory. Select the operator name for which you want to export the log,
# <Must set> Set your project name here export project_name=$your_project_name export deployment_name=$(kubectl get icp4acluster | awk '{print $1}' | grep -v "NAME") # Below can export CP4BA Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from Content Operator export operator_pod_name=$(kubectl get pod|grep ibm-cp4a-operator | awk '{print $1}') kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/ICP4ACluster/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log # Below can export Content Operator's Ansible log to /tmp/$operator_pod_name-log, only need this when Content pattern involved. export operator_pod_name=$(kubectl get pod|grep ibm-content-operator | awk '{print $1}') kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/Content/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log # Below can export Foundation Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from CP4BA Operator export operator_pod_name=$(kubectl get pod|grep icp4a-foundation-operator | awk '{print $1}') kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/icp4a.ibm.com/v1/Foundation/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log
Note: If you see "Cannot stat: No such file or directory" when you export the Ansible logs, it means that either no log that is generated from the current operator or the current operator is in its first reconcile.Optional: Edit the verbosity of the Ansible logs.
If the operator log does not provide the level of detail that you need, you can gather more details by adding an annotation like the following example to your custom resource YAML:
metadata: ... annotations: ansible.sdk.operatorframework.io/verbosity: "3" spec:
For the verbosity value, the normal rules for Ansible verbosity apply, where higher values mean more output. Acceptable values range from 0 (only the most severe messages are output) to 7 (all debugging messages are output). After you update the custom resource YAML, reapply the YAML for the changes to take effect.
- Getting information about pending pods
- If some pods are pending, choose one of the pods, and run the following command to get more
information.
kubectl describe pod <podname>
- Getting information about secrets
- Kubernetes secrets are used extensively, so output about them might also be
useful.
kubectl get secrets
- Getting information about events
- Kubernetes events are objects that provide more insight into what is happening inside a cluster,
such as what decisions the scheduler makes or why some pods are evicted from a node. To get
information about these events, run the following command.
kubectl get events > events.log
You can also add the verbose parameter to any kubectl command.
kubectl -v=9 get pods
- Enabling Liberty tracing for Liberty-based CP4BA pods
-
For FNCM, BAN, and ADP pods use the following steps to enable a WebSphere Application Server (WAS) Liberty logging trace specification:
- Create a custom_server.xml file with a custom Liberty trace specification. A Liberty trace specification can vary and depends on why you are enabling it. The specification might come from IBM support or Liberty support.
- Copy the custom_server.xml file into the target pod under the
/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides directory. This
directory is mapped to a PVC where the configuration file can be persisted.
WAS Liberty immediately detects the server configuration file and creates a trace.log file in the default directory, or in a custom directory if you specified one in the custom_server.xml file.
- Recreating the image pull secret
- If your Docker registry secret expires, you can delete the secret and re-create it:
oc delete secret admin.registrykey -n <namespace> oc create secret docker-registry admin.registrykey --docker-server=image-registry.openshift-image-registry.svc:5000 --docker-username=kubeadmin --docker-password=$(oc whoami -t)
- Applying changes by restarting pods
- In some cases, changes that you make in the custom resource YAML by using the operator or
directly in the environment are not automatically propagated to all pods. For example, modifications
to data source information or changes to Kubernetes secrets are not seen by running pods until the
pods are restarted.
If changes applied by the operator or other modifications that are made in the environment do not provide the expected result, restart the pods by scaling the impacted deployments down to 0 then up to the number that you want to have Kubernetes (OpenShift) terminate the existing pods and create new ones.
- Directory mount failure prevents pod readiness
- If a pod stays in a CreateContainerError state, and the description of the
problem includes similar text to the following message then remove the failing mounted
path.
Warning Failed 43m kubelet Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1" time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/icp4adeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**
The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.
You can remove a problematic folder from a deployment in two ways:
- If you can access the persistent volume, go to the mounted path and delete it. You can get the
path to the folder by running the following command.
oc describe pv $pv_name
- If you cannot access the persistent volume, edit the deployment by removing the failed mount.
- Edit the deployment by running the
oc edit deployment <deployment_name>
command. The following lines show an examplemountPath
:- mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml name: config-volume subPath: ibm_oidc_rp.xml
- You can then access the pod when it is Running by using the
oc exec -it
command.oc exec -it icp4adeploy-cmis-deploy-5cd4774f78-mg6pw bash
- Delete the file with the
rm
command.rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
- Edit the deployment by running the
When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.
- If you can access the persistent volume, go to the mounted path and delete it. You can get the
path to the folder by running the following command.
- Cannot log in to the Zen console
- After installation, you might not be able to log in to the Zen console by using the default
cluster administrator
admin
user name. The cause of this problem is that the nameadmin
also exists in the LDAP directory.To resolve the login issue, use the following steps.
- Change the name of the
admin
user in theplatform-auth-idp-credentials
secret. - Change the cluster-wide role biding
oidc-admin-binding
to the newadmin
user name. - Log in to the OpenShift console by using the new
admin
user name. - Add any new users that you need in the console.
- Change the name of the
- Zen issues with NGINX configuration
- If you see a "3.5.0.0 (xxxxxxxxxx) message instead of the user interface when you
try to access a component like ACCE or Navigator, use the following workaround to restart the pods for both the IBM NGINX and the Zen Watcher:
- Delete the IBM NGINX pod by running the following command. Replace $namespace
with the name of your target
project.
oc delete po -l component=ibm-nginx -n $namespace
The names of the deleted pods are returned:
pod "ibm-nginx-6d958c8cd6-dhllb" deleted pod "ibm-nginx-6d958c8cd6-n9qqh" deleted
- Delete the Zen Watcher pod to restart
it:
oc delete po -l component=zen-watcher -n $namespace
The name of the deleted pod is returned:
pod "zen-watcher-6c89d9fc7c-qw7rm" deleted
- Delete the IBM NGINX pod by running the following command. Replace $namespace
with the name of your target
project.
- Platform UI (Zen) becomes corrupted if the
ZenService
is deleted without uninstalling Cloud Pak for Business Automation -
If Zen is corrupted, uninstall Cloud Pak for Business Automation, delete the Zen associated PVs, and reinstall. The following errors are symptoms of a corrupted Zen.
- The roles and user role mappings are lost when the Zen PVs are removed.
- You might see "
<no data>
" in some UIs if the translation data is missing due to thezen-translation
jobs not running.
For more information about uninstalling Cloud Pak for Business Automation, see Uninstalling capabilities.
- Issues trying to install after you uninstalled
- If you see issues when you install a new instance on a cluster that you already used for a Cloud
Pak deployment, check if the IBM Automation Foundation dependencies are properly deleted.
For more information, see Uninstallation does not remove all components.
- Profile size does not scale down
- When you decrease the pattern profile size after installation, from large to medium or from medium to small, IBM Automation foundation and IBM Cloud Pak foundational services do not scale down with the profile size change. This behavior is expected. For more information about profile sizes, see System requirements.
- Operator pod in OOMKilled status
- If you see the Cloud Pak for Business Automation operator pod with a
status OOMKilled, it means that the resources that are allocated to the operator pod
is not enough for the workload. You can modify the
csv
to give the operator more resources. The following example can be adjusted to get the operator pod up and running again.oc patch csv ibm-cp4a-operator.v22.1.0 --type=json -p '[ { "op":"replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/cpu", "value": "4" }, { "op":"replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory", "value": "8Gi" }, { "op":"replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/cpu", "value": "1500m" }, { "op":"replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/memory", "value": "1600Mi" }, ]'
- Nginx deployment failed due to leftover Zen resources in cluster
-
If a Cloud Pak for Business Automation deployment shows a CrashLoopBackOff message for Nginx, then it indicates that a Nginx configuration is not cleaned up properly and the Nginx pods cannot be started (in CrashLoopBackOff).
oc get deploy | grep nginx ibm-nginx 0/2 2 0 12h ibm-nginx-tester 0/1 1 0 12h oc get po | grep nginx ibm-nginx-568667548b-6n4cw 0/1 CrashLoopBackOff 147 12h ibm-nginx-568667548b-q9d8r 0/1 CrashLoopBackOff 147 12h ibm-nginx-tester-684f8f9844-p6gp5 0/1 CrashLoopBackOff 147 12h setup-nginx-job-nqzgd 0/1 Completed 0 12h
To work around the problem, you must make sure that all the CP4BA generated .conf files are deleted from the PV of Nginx.
To delete the configuration files, use the following steps:
- Save the following template to a remove-zen-extension-pod.yaml file.
kind: Pod apiVersion: v1 metadata: name: remove-zen-extension-pod spec: containers: - name: remove-zen-extension-pod image: busybox securityContext: privileged: true runAsUser: 0 volumeMounts: - mountPath: "/data" name: my-volume command: [ "sleep", "1000000" ] volumes: - name: my-volume persistentVolumeClaim: claimName: user-home-pvc
- Run the "
oc apply -f remove-zen-extension-pod.yaml
" to create the pod. - Make sure that the
remove-zen-extension-pod
is up and running, and then log in to the pod by running the "oc rsh remove-zen-extension-pod
" command. - In the pod, delete all of the CP4BA generated .conf files under both
"/data/_global_/upstream-conf.d/" and
"/data/_global_/nginx-conf.d/".
METANAME="icp4adeploy" rm -rf /data/_global_/upstream-conf.d/${METANAME}* rm -rf /data/_global_/nginx-conf.d/${METANAME}*
Where METANAME is the value of the metadata.name parameter in the custom resource of your CP4BA deployment. The default name is
icp4adeploy
. - Restart the Nginx pods that showed the CrashLoopBackOff error, and when the new
Nginx pods are up and running delete the
remove-zen-extension-pod
by running the "oc delete pod remove-zen-extension-pod
" command.
- Save the following template to a remove-zen-extension-pod.yaml file.
What to do next
The custom resource can be configured to enable and disable specific logging parameters, log levels, log formats, and where these logs are stored for the various capabilities. If you need more information about specific Cloud Pak capabilities, go to the relevant troubleshooting topics.