Troubleshooting
Collect specific data about your environment and your installation before you contact IBM® support for assistance with an issue. Always provide a detailed description of the problem and your environment.
When you run diagnostic commands, run them from an empty directory to package the files more cleanly. Run the commands from the namespace in which you observe the problematic container or component.
The ibm-content-operator locates the Content Cortex base images and has Ansible® roles to handle the reconciliation logic and declare a set of playbook tasks for each component. The roles declare all the variables and defaults for how the role is executed.
Use the following sections to find the information that you are looking for.
Must Gather
You can use the must_gather_py script to gather information such as configmaps, pods, secrets, and log files about the Content Cortex resources in the target namespace.
The must_gather_py script uses the following utility tools and needs them to be installed on your client machine.
- Kubernetes CLI
- Python
ibm-content-cortex-containersrepository from GitHub
If the script finds that any of these tools are missing on the client, it reports which tools are missing and provides a choice to install the tool.
To run the script and to know more about the information the script
collects, see technote Gathering deployment information and logs from Content Cortex
.
Collecting data
- Getting the logs of the Ansible-based operators
- The Content operator is an Ansible-based operator. To get the log of the latest reconciliation for Ansible-based operators, run the following
command:
# <Must set> Set your project name here export project_name=$your_project_name # <Must set> Set target operator name here export operator_name=ibm-content-operator operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}') kubectl exec -i $operator_pod_name -n $project_name -- /bin/bash -c 'cat /tmp/ansible-operator/runner/fncm.ibm.com/v1/*/*/*/artifacts/latest/stdout' > operator-ansible.logOptional: Export the history of the Ansible logs.
Ansible operators keep a backup of the logs under /logs/$operator_pod_name/ansible-operator/runner/<group>/<version>/<kind>/<namespace>/<name>/artifacts. The log contains information on the first 10 reconciles, including the latest reconcile. The following commands copy the logs to a local directory. Select the operator name for which you want to export the log.
# <Must set> Set your project name here export project_name=$your_project_name # <Must set> Set target operator name here export operator_name=ibm-content-operator export deployment_name=$(kubectl get fncmcluster | awk '{print $1}' | grep -v "NAME") # Below can export Content Operator's Ansible log to /tmp/$operator_pod_name-log, do not need this when you install from Content Operator export operator_pod_name=$(kubectl get pod|grep ibm-content-operator | awk '{print $1}') kubectl cp $project_name/$operator_pod_name:/logs/$operator_pod_name/ansible-operator/runner/fncm.ibm.com/v1/FNCMCluster/$project_name/$deployment_name/artifacts /tmp/$operator_pod_name-logNote: If you see "Cannot stat: No such file or directory" when you export the Ansible logs, it means that either no log that is generated from the current operator or the current operator is in its first reconcile.Optional: Edit the verbosity of the Ansible logs.
If the operator log does not provide the level of detail that you need, you can gather more details by adding an annotation like the following example to your custom resource YAML:
metadata: ... annotations: ansible.sdk.operatorframework.io/verbosity: "3" spec:For the verbosity value, the normal rules for Ansible verbosity apply, where higher values mean more output. Acceptable values range from 0 (only the most severe messages are output) to 7 (all debugging messages are output). After you update the custom resource YAML, reapply the YAML for the changes to take effect.
- Getting the logs of the Go -based operators
- To get the logs of the Content Cortex AI Services operator,
run the following command:
# <Must set> Set your project name here export project_name=$your_project_name # <Must set> Set target operator name here export operator_name=ibm-ccx-ai-services-operator operator_pod_name=$(kubectl get pod|grep $operator_name | awk '{print $1}') kubectl logs $operator_pod_name -n $project_name > $operator_pod_name.logYou can enable more logging by setting
shared_configuration.show_sensitive_logsas true in the CR YAML file. - Getting information about pending pods
- If some pods are pending, choose one of the pods, and run the following command to get more
information.
kubectl describe pod <podname> - Getting information about secrets
- Kubernetes secrets are used extensively, so output about them might also be
useful.
kubectl get secrets - Getting information about events
- Kubernetes events are objects that provide more insight into what is happening inside a cluster,
such as what decisions the scheduler makes or why some pods are evicted from a node. To get
information about these events, run the following command.
kubectl get events > events.logYou can also add the verbose parameter to any kubectl command.
kubectl -v=9 get pods - Enabling Liberty tracing for Liberty-based pods
-
For Content Cortex and Navigator pods use the following steps to enable a WebSphere® Application Server (WAS) Liberty logging trace specification:
- Create a custom_server.xml file with a custom Liberty trace specification. A Liberty trace specification can vary and depends on why you are enabling it. The specification might come from IBM support or Liberty support.
- Copy the custom_server.xml file into the target pod under the
/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides directory. This
directory is mapped to a PVC where the configuration file can be persisted.
WAS Liberty immediately detects the server configuration file and creates a trace.log file in the default directory, or in a custom directory if you specified one in the custom_server.xml file.
Resolving issues
- Cannot connect to the web client when accessing Navigator
- If you see an error message that states a client cannot connect to the web client, then refresh
your browser and the connection message goes away.
The cannot connect message appears during a relatively short window of time when the back-end Navigator pod is rescheduled. For example, when you make an update to the Navigator admin desktop properties, or you create a new role or policy. Sometimes these actions prompt connection errors, but usually it writes a message that states the server is unavailable.
- Re-creating the image pull secret
- If your Docker registry secret expires, you can delete the secret and re-create it:
oc delete secret ibm-entitlement-key -n <namespace> oc create secret docker-registry ibm-entitlement-key --docker-server=image-registry.openshift-image-registry.svc:5000 --docker-username=kubeadmin --docker-password=$(oc whoami -t) - Applying changes by restarting pods
- Sometimes, changes that you make in the custom resource YAML by using the operator or directly
in the environment are not automatically propagated to all pods. For example, modifications to data
source information or changes to Kubernetes secrets are not seen by running pods until the pods are
restarted.
If changes applied by the operator or other modifications that are made in the environment do not provide the expected result, restart the pods by scaling the impacted deployments down to 0 then up to the number that you want to have. Kubernetes (OpenShift®) terminates the existing pods and creates new ones.
- CrashLoopBackOff status when an ODF storage class is used
- If you install a Content Cortex instance that uses an ODF storage class, you might see some pods that fail
to be ready after the cluster is rebooted.
To resolve the issue, manually restart the pods that fail to be ready.
- Directory mount failure prevents pod readiness
- If a pod stays in a CreateContainerError state, and the description of the
problem includes similar text to the following message then remove the failing mounted
path.
Warning Failed 43m kubelet Error: container create failed: time="2021-03-03T07:26:47Z" level=warning msg="unable to terminate initProcess" error="exit status 1" time="2021-03-03T07:26:47Z" level=error msg="container_linux.go:366: starting container process caused: process_linux.go:472: container init caused: rootfs_linux.go:60: mounting \"/var/lib/kubelet/pods/473b091d-acff-437b-b568-2383604dac01/volume-subpaths/config-volume/fncmdeploy-cmis-deploy/3\" to rootfs at **\"/var/lib/containers/storage/overlay/d011608f6df4bbfcc26c7d60568915caf7932124e61924b1a75802e6884ea060/merged/opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml\" caused: not a directory"**The problem occurs when a folder is generated instead of an XML file. A null folder is created to mount the file to the deployment and this raises the error.
You can remove a problematic folder from a deployment in two ways:
- If you can access the persistent volume, go to the mounted path and delete it. You can get the
path to the folder by running the following command.
kubectl describe pv $pv_name - If you cannot access the persistent volume, edit the deployment by removing the failed mount.
- Edit the deployment by running the
oc edit deployment <deployment_name>command. The following lines show an examplemountPath:- mountPath: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml name: config-volume subPath: ibm_oidc_rp.xml - You can then access the pod when it is Running by using the
oc exec -itcommand.kubecl exec -it fncmdeploy-cmis-deploy-5cd4774f78-mg6pw bash - Delete the file with the
rmcommand.rm /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/ibm_oidc_rp.xml
- Edit the deployment by running the
When the folder is removed, you can wait for the operator to reconcile the change or add the removed mount path back manually to fix it.
- If you can access the persistent volume, go to the mounted path and delete it. You can get the
path to the folder by running the following command.
- Operator pods get evicted due to the lack of ephemeral storage
-
Some of the operator pods might get evicted when the pre-defined ephemeral storage is full. The causes to this problem might include, but are not limited to:
- Debug logging is turned on.
- Retrying logic of the operator is waiting for some resources to be available.
Typically, the pod might show the following error when you get the description of the pod:Status: Failed Reason: Evicted Message: Pod ephemeral local storage usage exceeds the total limit of containers 500Mi.Follow these steps to resolve the problem:- Determine the CSV that the operator belongs to. Typically, you can predict the CSV name by
looking at the operator's pod name. For example, if the operator's pod name is
ibm-content-operator-59797bd587-clqv7, then you can find its CSV with the following command:CSV=$(oc get csv |grep ibm-content-operator | awk '{print $1}') echo $CSV - Update the ephemeral storage to a specific value. In the following example, the new ephemeral
storage size is
800Mi:
oc patch csv "$CSV" --type="json" -p="[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/ephemeral-storage", "value": "800Mi"}]"
After updating the value, the operator pod is restarted.
- Operator cannot connect to the OCP API endpoint
-
If you see the following error on your cluster, then change the value of the namespaces for the API Server parameters (sc_api_namespace) in the
sc_egress_configurationsection of the custom resource to "{}"."stderr": "Error from server (InternalError): an error on the server (\"dial tcp XXX.XX.X.X:443: i/o timeout\") has prevented the request from succeeding", "stderr_lines": ["Error from server (InternalError): an error on the server (\"dial tcp XXX.XX.X.X:443: i/o timeout\") has prevented the request from succeeding"]
For more information, see Shared configuration parameters.
- Additional AI Services troubleshooting
-
For additional troubleshooting information about AI Services features, see the following topics: