Troubleshooting

If your "starter" deployment is not working as you expect, check out the listed issues and try the mitigation or workarounds.

The troubleshooting information is divided into the following sections:

BTS operator stuck because the OIDC Client secret is not deployed

If you see the following message repeatedly in the Business Teams Service (ibm-bts-operator-controller-manager pod) operator logs.

INFO controllers.BusinessTeamsService 
IAM Client secret ibm-bts-oidc-client-secret not yet found, retry after 5 seconds...

And the OIDC Client custom resource (CR) shows the status as False.

status:
  conditions:
    - lastTransitionTime: '2023-10-17T18:09:02Z'
      message: OIDC client registration create failed
      reason: CreateClientFailed
      status: 'False'
      type: Ready

Then take the following actions:

  1. Restart the pods of the Identity Management (IM) service (platform-auth-service).
    oc get pod
    oc delete pod platform-auth-service-xxxxxxxxxx-xxxxx
  2. Wait for the pods to be ready.
  3. Trigger a reconcile of the OIDC Client.

    It can be triggered by adding a dummy label, for example, by running the following command.

    oc patch client ibm-bts-oidc-client --patch '{"metadata": {"labels": {"dummyLabel": "dummyLabel"}}}' --type=merge

    After a couple of minutes, the secret ibm-bts-oidc-client-secret appears and the Business Teams Service pods are created.

PostgreSQL cannot read the SSL key file

An intermittent PostgreSQL issue can occur reading the SSL key in any of the Cloud Pak for Business Automation pods. The following message informs you about the issue:

SEVERE: CWLPS1105: An error occurred while call the db initialize: org.postgresql.util.PSQLException: Could not read SSL key file /client-cert/tls.key

If you see the message in the logs, take the following actions:

  1. Delete the secret icp4adeploy-pg-client-cert-secret.
  2. Delete the configMap icp4adeploy-prereq-config.

The CP4BA operator then re-creates the secret and the configMap, which normally resolves the issue.

Directory mount failure prevents pod readiness

If a pod stays in a CreateContainerError state, and the description of the problem includes similar text to the following message then remove the problematic mounted path.

Warning  Failed  43m  kubelet  Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/icp4adeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**

The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.

You can remove a problematic folder from a deployment in two ways:

  • If you can access the persistent volume, go to the mounted path and delete it. You can get the path to the folder by running the following command.
    oc describe pv $pv_name
  • If you cannot access the persistent volume, edit the deployment by removing the failed mount.
    1. Edit the deployment by running the oc edit deployment <deployment_name> command. The following lines show an example mountPath:
      - mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
                name: config-volume
                subPath: ibm_oidc_rp.xml
    2. You can then access the pod when it is Running by using the oc exec -it command.
      oc exec -it icp4adeploy-cmis-deploy-5cd4774f78-mg6pw bash
    3. Delete the file with the rm command.
      rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml

When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.

Cluster admin setup script issues

During the execution of the cp4a-clusteradmin-setup.sh script, the CRD fails to deploy. If the following message is seen in the output, the user ('XYZ' in the example) does not have cluster-admin permission:

Start to create CRD, service account and role ...
Error from server (Forbidden): error when retrieving current configuration of: "/root/git/cert-kubernetes/descriptors/ibm_cp4a_crd.yaml": 
customresourcedefinitions.apiextensions.k8s.io "icp4aclusters.icp4a.ibm.com" is forbidden: 
User "XYZ" cannot get customresourcedefinitions.apiextensions.k8s.io at the cluster scope: 
no RBAC policy matched
  1. Log out of the current session (non-admin).
  2. Log in to OCP with the OCP cluster admin user. Using the OpenShift CLI:
    oc login -u dbaadmin

    Where dbaadmin is the cluster admin user.

Case init job failure

  • If the Case init job restarts several times but fails, do the following steps.
    1. Check the Case init job pod logs by running a command similar to the following command:
      oc logs --previous case init job pod

      If the result has the following error, the Case init job is running into a Content Platform Engine timeout.

      CPE_URL=https://bawps-cpe-svc:9443/wsi/FNCEWS40MTOM
      Certificate was added to keystore
      log4j:WARN No appenders could be found for logger (filenet_error.api.com.filenet.apiimpl.util.ConfigValueLookup).
      log4j:WARN Please initialize the log4j system properly.
      CPE URI :https://bawps-cpe-svc:9443/wsi/FNCEWS40MTOM
      [Perf Log] No interval found. Auditor disabled.
      P8DOMAIN
      starting setup DOS and TOS
      executing setupTOS
      java.lang.RuntimeException: The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors>
      The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors>
      java.lang.RuntimeException: The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors>
      at com.ibm.casemgmt.config.ContentEngineHelper.setUpCMTOS(ContentEngineHelper.java:1833)
      at com.ibm.ecm.icm.config.init.repository.ConfigureObjectStore.setupTOS(ConfigureObjectStore.java:99)
      at com.ibm.ecm.icm.config.init.test.ConfigureContentEngine.installAddons(ConfigureContentEngine.java:48)
      at com.ibm.ecm.icm.config.init.test.InitCaseManager.main(InitCaseManager.java:19)

      The timeout might be caused by a network problem due to the database remote location or database access problems. Check the database logs for the source of the problem.

    2. Create a z.xml Liberty configuration file to overwrite the timeout, with the following content:
      <server>
      <transaction clientInactivityTimeout="1800s" propogatedOrBMTTranLifetimeTimeout="1800s" totalTranLifetimeTimeout="1800s"/>
      </server>

      Then run:

      podname=$(oc get pod | grep cpe-deploy | awk '{print $1}')
      echo $podname
      oc cp z.xml $podname:/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides
    3. Restart the icp4adeploy-cpe-deploy-nnn pod. You might also need to restart the icp4adeploy-bawins1-baw-case-init-job-nnnnnn pod.
    4. If the Case init job stops generating new pods, delete the Case init job and let the operator re-create it.