Troubleshooting

If your "demo" deployment is not working as you expect, check out the listed issues and try the mitigation or workarounds.

The troubleshooting information is divided into the following sections:

Directory mount failure prevents pod readiness

If a pod stays in a CreateContainerError state, and the description of the problem includes similar text to the following message then remove the problematic mounted path.

Warning  Failed  43m  kubelet  Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/icp4adeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**

The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.

You can remove a problematic folder from a deployment in two ways:

  • If you can access the persistent volume, go to the mounted path and delete it. You can get the path to the folder by running the following command.
    oc describe pv $pv_name
  • If you cannot access the persistent volume, edit the deployment by removing the failed mount.
    1. Edit the deployment by running the oc edit deployment <deployment_name> command. The following lines show an example mountPath:
      - mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
                name: config-volume
                subPath: ibm_oidc_rp.xml
    2. You can then access the pod when it is Running by using the oc exec -it command.
      oc exec -it icp4adeploy-cmis-deploy-5cd4774f78-mg6pw bash
    3. Delete the file with the rm command.
      rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml

When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.

Cluster admin setup script issues

During the execution of the cp4a-clusteradmin-setup.sh script, the CRD fails to deploy. If the following message is seen in the output, the user ('XYZ' in the example) does not have cluster-admin permission:

Start to create CRD, service account and role ...
Error from server (Forbidden): error when retrieving current configuration of: "/root/git/cert-kubernetes/descriptors/ibm_cp4a_crd.yaml": 
customresourcedefinitions.apiextensions.k8s.io "icp4aclusters.icp4a.ibm.com" is forbidden: 
User "XYZ" cannot get customresourcedefinitions.apiextensions.k8s.io at the cluster scope: 
no RBAC policy matched
  1. Log out of the current session (non-admin).
  2. Log in to OCP with the OCP cluster admin user. Using the OpenShift CLI:
    oc login -u dbaadmin

    Where dbaadmin is the cluster admin user.

Db2 issues

For 21.0.1 Db2 is installed as part of the prerequisites of the patterns. The following issues can be resolved by matching the source of the problem with the proposed solution to make Db2 operational again.

Reconciler error because Db2 cannot create schema

If you try to install a second deployment on the same cluster, you might encounter an installation issue. You might see in the operator log a reconciliation error if the deployment is trying to use a worker node where Db2 is already running. To resolve the issue, delete the first deployment.

Intermittent issue where Db2 process is not listening on port 50000

If the message "not listening on port 50000" is found in the logs:

  1. Get the current running Db2 pod. Using the OpenShift CLI:
    oc get pod
  2. Go to the pod. Using the OpenShift CLI:
    oc exec -it <db2 pod> bash
  3. Switch to the db2inst1 user:
    su - db2inst1
  4. Reapply the configuration:
    db2 update dbm cfg using SVCENAME 50000
  5. Restart Db2:
    db2stop
    db2start
Db2 pod failed to start where db2u-release-db2u-0 pod shows 0/1 Ready
This issue has the following symptoms in the Db2 pods:
[5357278.440940] db2u_root_entrypoint.sh[20]: + sudo /opt/ibm/db2/V11.5.0.0/adm/db2licm -a /db2u/license/db2u-lic
[5357278.531782] db2u_root_entrypoint.sh[20]: LIC1416N  The license could not be added automatically.  Return code: "-100".
[5357278.535893] db2u_root_entrypoint.sh[20]: + [[ 156 -ne 0 ]]
[5357278.536085] db2u_root_entrypoint.sh[20]: + echo '(*) Unable to apply db2 license.'
[5357278.536177] db2u_root_entrypoint.sh[20]: (*) Unable to apply db2 license.

To mitigate the issue, you have a number of options:

Option 1: Kill Db2

  1. Run the following command to get the worker node that db2u is running on. Using the OpenShift CLI:
    oc get nodes -o wide
  2. Run an ssh command as root on the worker node that hosts Db2u:
    ssh root@<worker node>
  3. Run the following command to kill the orphaned db2u semaphores:
    ipcrm -S 0x61a8 
  4. Clean up the affected project/namespace:

    The following OCP CLI command gets the custom resource name:

    oc get icp4acluster

    Delete the custom resource:

    oc delete icpa4acluster $name

    Where $name is the result from the previous command.

    Delete the operator deployment.

    oc delete <operator-deployment-name>
  5. Run the deployment script to start again.

Option 2: Clean Db2 and redeploy

  1. Get the custom resource name for icp4acluster. Using the OpenShift CLI:
    oc get icp4acluster
  2. Delete the CR. Using the OpenShift CLI:
    oc delete icp4acluster $name
    or
    oc delete -f $cr.yaml
    The $cr.yaml is generated in the ./tmp directory, so you also need to delete the operator deployment by running the following OCP CLI command:
    oc delete <operator-deployment-name>
  3. Make sure that nothing is leftover by running the following OCP CLI commands:
    oc get sts
    oc get jobs
    oc get deployment
    oc get pvc | grep db2
  4. Run the deployment script to start again.

Option 3: Delete the project/namespace

If options 1 or 2 do not work, delete the project and redeployment by running the following OCP CLI command:

oc delete project $project_name

Option 4: Restart the entire cluster

  1. If none of the other options work, get the names of the nodes and restart them. Using the OpenShift CLI:
    oc get no --no-headers | awk '{print $1}'
  2. Restart all of the nodes listed (restart the worker nodes first, then the infrastructure node, and then the master node).
db2-release-db2u-restore-morph-job-xxxxx shows "Running", but fails to be "Completed"
Run the following OCP CLI command to check and confirm this issue:
oc get pod

The command outputs a table that shows the STATUS and READY columns:

NAME                                            READY        STATUS 
db2-release-db2u-restore-morph-job-xxxxx        1/1          Running

If the STATUS does not change to Completed after a few minutes.

  1. Delete the Db2 pod by running the oc delete command:
    oc delete pod db2-release-db2u-restore-morph-job-xxxxx
  2. Confirm that the Db2 job is terminated and a new pod is up and running:
    oc get pod -w
    When the job reads Completed, the pattern can continue to deploy.
db2-release-db2u-restore-morph-job-xxxxx failed on bare metal nodes
If your deployment uses bare metal nodes on your ROKS cluster, then you need to make the following updates to workaround the failing db2-release-0 pod on a bare metal node.
  1. Get the node information on the cluster by running the following commands:
    oc get nodes 
    oc get nodes --show-labels 
    

    It is important to identify which nodes are bare metal.

  2. To make sure that the bare metal nodes cannot be scheduled run the following command:
    oc adm cordon <node-name>
    
  3. Delete the db2morph job:
    oc delete job db2-release-db2u-restore-morph-job
    
  4. Delete the Db2 release pod:
    oc delete pod db2u-release-db2u-0
    
  5. Make sure that the new Db2 release pod moved to a non-bare metal node:
    oc get pods -o wide | grep db2u-release-db2u-0
    
  6. Delete the operator pod to recreate the morph job:
    oc get pods | grep ibm-cp4a-operator
    oc delete pod <operator-pod-name>
    

    After the morph job is created and the operator starts deploying the RR/UMS pods, you can make your bare metal nodes schedule-able again.

  7. To make a bare metal node schedule-able, run the following command:
    oc adm uncordon <node-name>
    
db2-release-db2u pods cannot be accessed after deployment
Open the operator log to view the deployment progress.
oc logs <operator pod name> -c operator -n <project-name>
Search for the string "db2u-release-db2u-statefulset pod is ready" in the log. The log might show the status of the db2u-release-db2u-statefulset pod as RETRYING.
TASK [prerequisites : check if db2u-release-db2u-statefulset pod is ready] *****^M
^[[1;30mtask path: /opt/ansible/roles/prerequisites/tasks/db2/db2-deploy.yml:141^[[0m^M
^[[1;30m^[[
Monday 04 May 2020  23:29:45 +0000 (0:00:00.095)       0:01:09.060 ************
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (35 retries left).^[[0m^M
^[[1;30m^[[
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (34 retries left).^[[0m^M
^[[1;30m^[[
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (33 retries left).^[[0m^M
^[[1;30m^[[
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (32 retries left).^[[0m^M
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (3 retries left).^[[0m^M
^[[1;30m^[[
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (2 retries left).^[[0m^M
^[[1;30m^[[
^[[1;30mFAILED - RETRYING: check if db2u-release-db2u-statefulset pod is ready (1 retries left).^[[0m^M
^[[1;30m^[[

If you do see the RETRYING message, the shell script that runs inside the db2u pod is timing out. If the pods are not in a ready state after 20 to 25 minutes, delete Db2 and redeploy.

Database issues after a cluster reboot
A cluster reboot can cause permissions issues with Db2. Check ~/sqllib/security/db2chpw and ~/sqllib/security/db2ckpw in the Db2 pod for -r-s--x--x permission. If these are not set properly, use the following instructions to fix them: https://www.ibm.com/support/pages/database-connection-fails-authentication-error-sql1639n.
Use the following commands:
oc exec -it db2u-release-db2u-0 bash
sudo wvcli system disable -m "Disable HA before Db2 maintenance"
su db2inst1
db2stop
exit
cd /mnt/blumeta0/home/db2inst1/sqllib/security
chmod 4411 db2chpw db2ckpw
cd /opt/ibm/db2/V11.5.0.0/instance/
./db2iupdt db2inst1
sudo wvcli system enable -m "Enable HA after Db2 maintenance"

Afterward, delete the Content Platform Engine pod so that the cluster can recreate the pod.

Project database limit for the Document Processing pattern
The evaluation deployment for Document Processing includes one project database. This configuration supports the creation of only one Document Processing project.

Generated routes do not work

In some environments, route URLs contain the string apps.. However, the cp4a-clusteradmin-setup.sh script returns the hostname of the infrastructure node without this string. If you entered the hostname in the cp4a-post-deployment.sh script in an environment that uses apps., the routes do not work.

Workaround: When you run the cp4a-deployment.sh script, add apps. to the infrastructure hostname.

For example, if the cp4a-clusteradmin-setup.sh script outputs the infrastructure hostname as ocp-master.tec.uk.ibm.com, enter ocp-master.apps.tec.uk.ibm.com when you run the cp4a-post-deployment.sh script.

Tip: You can find the existing route URL by running oc get route --all-namespaces, and extract the common pattern URL for the routes.

Case init job failure

  • If the case init job restarts several times but fails, do the following steps.
    1. Check the case init job pod logs by running a command similar to the following command:
      oc logs --previous case init job pod
      If the result has the following error, the case init job is running into a Content Platform Engine timeout.
      CPE_URL=http://bawps-cpe-svc:9080/wsi/FNCEWS40MTOM
      Certificate was added to keystore
      log4j:WARN No appenders could be found for logger (filenet_error.api.com.filenet.apiimpl.util.ConfigValueLookup).
      log4j:WARN Please initialize the log4j system properly.
      CPE URI :http://bawps-cpe-svc:9080/wsi/FNCEWS40MTOM
      [Perf Log] No interval found. Auditor disabled.
      P8DOMAIN
      starting setup DOS and TOS
      executing setupTOS
      java.lang.RuntimeException: The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors>
      The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors>
      java.lang.RuntimeException: The case management add-ons cannot be installed in Content Engine. The installation of the AddOn 20.0.0.1 Case Management Target Object Store Extensions into the object store TARGET failed. The installation report follows: <ImportErrors><ClassDefinitions><ReplicableClassDefinition><Id>6d18ffeb-7be8-41ac-9322-38a72743a10d</Id><Name>Health Condition</Name><ExceptionMessage>The database access failed with the following error: ErrorCode 0, Message 'addSync: caught Exception' ObjectStore: "TARGET", SQL: "SELECT security_id FROM OS2USER.TableDefinition WHERE (object_id = ?)"</ExceptionMessage><ExceptionCode>DB_ERROR</ExceptionCode><HRESULT>0x800710d9</HRESULT></ReplicableClassDefinition></ClassDefinitions></ImportErrors>
      at com.ibm.casemgmt.config.ContentEngineHelper.setUpCMTOS(ContentEngineHelper.java:1833)
      at com.ibm.ecm.icm.config.init.repository.ConfigureObjectStore.setupTOS(ConfigureObjectStore.java:99)
      at com.ibm.ecm.icm.config.init.test.ConfigureContentEngine.installAddons(ConfigureContentEngine.java:48)
      at com.ibm.ecm.icm.config.init.test.InitCaseManager.main(InitCaseManager.java:19)
    2. Add a Liberty configuration file to overwrite the timeout, with the following content:
      <server>
      <transaction clientInactivityTimeout="1800s" propogatedOrBMTTranLifetimeTimeout="1800s" totalTranLifetimeTimeout="1800s"/>
      </server>
      For more information, see Tuning IBM WebSphere® Liberty for FileNet® Content Manager components.
    3. If the case init job stops generating new pods, delete the case init job and let the operator re-create it.