Troubleshooting

It is necessary to collect specific information about your deployments to measure their well-being and to diagnose a problem before you open a support case on container deployment.

Collect specific data about your environment and your installation before you contact IBM® support for assistance with an issue. Always provide a detailed description of the problem and your environment.

When you run diagnostic commands, run them from an empty directory to package the files more cleanly. Run the commands from the namespace in which you observe the problematic container or component.

The ibm-content-operator locates the Content Cortex base images and has Ansible® roles to handle the reconciliation logic and declare a set of playbook tasks for each component. The roles declare all the variables and defaults for how the role is executed.

Use the following sections to find the information that you are looking for.

Must Gather
Collecting data
Resolving issues

Must Gather

You can use the must_gather_py script to gather information such as configmaps, pods, secrets, and log files about the Content Cortex resources in the target namespace.

The must_gather_py script uses the following utility tools and needs them to be installed on your client machine.

Kubernetes CLI
Python
ibm-content-cortex-containers repository from GitHub

To install the specified tools that are used by the must_gather.py script, see topic Preparing a client to connect to the cluster.

If the script finds that any of these tools are missing on the client, it reports which tools are missing and provides a choice to install the tool.

To run the script and to know more about the information the script collects, see technote Gathering deployment information and logs from Content Cortex External link opens a new window or tab .

Collecting data

Getting the logs of the Ansible-based operators

The Content operator is an Ansible-based operator. To get the log of the latest reconciliation for Ansible-based operators, run the following command:

# <Must set> Set your project name here
export project_name=$your_project_name 

# <Must set> Set target operator name here 
export operator_name=ibm-content-operator

operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}') 
kubectl exec -i $operator_pod_name -n $project_name -- /bin/bash -c 'cat /tmp/ansible-operator/runner/fncm.ibm.com/v1/*/*/*/artifacts/latest/stdout' > operator-ansible.log

Optional: Export the history of the Ansible logs.

Ansible operators keep a backup of the logs under /logs/$operator_pod_name/ansible-operator/runner/<group>/<version>/<kind>/<namespace>/<name>/artifacts. The log contains information on the first 10 reconciles, including the latest reconcile. The following commands copy the logs to a local directory. Select the operator name for which you want to export the log.

# <Must set> Set your project name here 
export project_name=$your_project_name

# <Must set> Set target operator name here 
export operator_name=ibm-content-operator

export deployment_name=$(kubectl get fncmcluster | awk '{print $1}' | grep -v "NAME")

# Below can export Content Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from Content Operator
export operator_pod_name=$(kubectl get pod|grep ibm-content-operator | awk '{print $1}')
kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/fncm.ibm.com/v1/FNCMCluster/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-log

Note: If you see "Cannot stat: No such file or directory" when you export the Ansible logs, it means that either no log that is generated from the current operator or the current operator is in its first reconcile.

Optional: Edit the verbosity of the Ansible logs.

If the operator log does not provide the level of detail that you need, you can gather more details by adding an annotation like the following example to your custom resource YAML:

metadata:
 ...
   annotations:
     ansible.sdk.operatorframework.io/verbosity: "3"
spec:

For the verbosity value, the normal rules for Ansible verbosity apply, where higher values mean more output. Acceptable values range from 0 (only the most severe messages are output) to 7 (all debugging messages are output). After you update the custom resource YAML, reapply the YAML for the changes to take effect.

Getting the logs of the Go -based operators

To get the logs of the Content Cortex AI Services operator, run the following command:

# <Must set> Set your project name here
export project_name=$your_project_name 

# <Must set> Set target operator name here 
export operator_name=ibm-ccx-ai-services-operator

operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}') 
kubectl logs $operator_pod_name -n $project_name > $operator_pod_name.log

You can enable more logging by setting shared_configuration.show_sensitive_logs as true in the CR YAML file.

Getting information about pending pods

If some pods are pending, choose one of the pods, and run the following command to get more information.

kubectl describe pod <podname>

Getting information about secrets

Kubernetes secrets are used extensively, so output about them might also be useful.

kubectl get secrets

Getting information about events

Kubernetes events are objects that provide more insight into what is happening inside a cluster, such as what decisions the scheduler makes or why some pods are evicted from a node. To get information about these events, run the following command.

kubectl get events > events.log

You can also add the verbose parameter to any kubectl command.

kubectl -v=9 get pods

Enabling Liberty tracing for Liberty-based pods

For Content Cortex and Navigator pods use the following steps to enable a WebSphere® Application Server (WAS) Liberty logging trace specification:

Create a custom_server.xml file with a custom Liberty trace specification. A Liberty trace specification can vary and depends on why you are enabling it. The specification might come from IBM support or Liberty support.
Copy the custom_server.xml file into the target pod under the /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides directory. This directory is mapped to a PVC where the configuration file can be persisted.
WAS Liberty immediately detects the server configuration file and creates a trace.log file in the default directory, or in a custom directory if you specified one in the custom_server.xml file.

Resolving issues

Cannot connect to the web client when accessing Navigator

If you see an error message that states a client cannot connect to the web client, then refresh your browser and the connection message goes away.

The cannot connect message appears during a relatively short window of time when the back-end Navigator pod is rescheduled. For example, when you make an update to the Navigator admin desktop properties, or you create a new role or policy. Sometimes these actions prompt connection errors, but usually it writes a message that states the server is unavailable.

Re-creating the image pull secret

If your Docker registry secret expires, you can delete the secret and re-create it:

oc delete secret ibm-entitlement-key -n <namespace>
oc create secret docker-registry ibm-entitlement-key --docker-server=image-registry.openshift-image-registry.svc:5000 --docker-username=kubeadmin --docker-password=$(oc whoami -t)

Applying changes by restarting pods

Sometimes, changes that you make in the custom resource YAML by using the operator or directly in the environment are not automatically propagated to all pods. For example, modifications to data source information or changes to Kubernetes secrets are not seen by running pods until the pods are restarted.

If changes applied by the operator or other modifications that are made in the environment do not provide the expected result, restart the pods by scaling the impacted deployments down to 0 then up to the number that you want to have. Kubernetes (OpenShift®) terminates the existing pods and creates new ones.

CrashLoopBackOff status when an ODF storage class is used

If you install a Content Cortex instance that uses an ODF storage class, you might see some pods that fail to be ready after the cluster is rebooted.

To resolve the issue, manually restart the pods that fail to be ready.

Directory mount failure prevents pod readiness

If a pod stays in a CreateContainerError state, and the description of the problem includes similar text to the following message then remove the failing mounted path.

Warning  Failed  43m  kubelet  Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1"
time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/fncmdeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**

The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.

You can remove a problematic folder from a deployment in two ways:

If you can access the persistent volume, go to the mounted path and delete it. You can get the path to the folder by running the following command.
```
kubectl describe pv $pv_name
```
If you cannot access the persistent volume, edit the deployment by removing the failed mount.
1. Edit the deployment by running the oc edit deployment <deployment_name> command. The following lines show an example mountPath:
```
- mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
          name: config-volume
          subPath: ibm_oidc_rp.xml
```
2. You can then access the pod when it is Running by using the oc exec -it command.
```
kubecl exec -it fncmdeploy-cmis-deploy-5cd4774f78-mg6pw bash
```
3. Delete the file with the rm command.
```
rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
```

When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.

Operator pods get evicted due to the lack of ephemeral storage

Some of the operator pods might get evicted when the pre-defined ephemeral storage is full. The causes to this problem might include, but are not limited to:

Debug logging is turned on.
Retrying logic of the operator is waiting for some resources to be available.

Typically, the pod might show the following error when you get the description of the pod:

Status:       Failed
Reason:       Evicted
Message:      Pod ephemeral local storage usage exceeds the total limit of containers 500Mi.

Follow these steps to resolve the problem:

Determine the CSV that the operator belongs to. Typically, you can predict the CSV name by looking at the operator's pod name. For example, if the operator's pod name is ibm-content-operator-59797bd587-clqv7, then you can find its CSV with the following command:
```
CSV=$(oc get csv |grep ibm-content-operator | awk '{print $1}') 
echo $CSV
```

Update the ephemeral storage to a specific value. In the following example, the new ephemeral storage size is 800Mi:

oc patch csv "$CSV" --type="json" -p="[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/ephemeral-storage", "value": "800Mi"}]"

After updating the value, the operator pod is restarted.

Operator cannot connect to the OCP API endpoint

If you see the following error on your cluster, then change the value of the namespaces for the API Server parameters (sc_api_namespace) in the sc_egress_configuration section of the custom resource to "{}".

"stderr": "Error from server (InternalError): an error on the server 
(\"dial tcp XXX.XX.X.X:443: i/o timeout\") has prevented the request from succeeding", 
"stderr_lines": ["Error from server (InternalError): an error on the server 
(\"dial tcp XXX.XX.X.X:443: i/o timeout\") has prevented the request from succeeding"]

For more information, see Shared configuration parameters.

Additional AI Services troubleshooting

For additional troubleshooting information about AI Services features, see the following topics: