Troubleshooting the operator issues
Use the following topics to resolve common issues with the operator-based deployment.
Troubleshooting issues in a Kubernetes environment can be challenging due to the complexity of interconnected components. It often involves reviewing the following circumstances to pinpoint the root cause of the problem.
- Recent changes
- Check what has recently changed in your Kubernetes setup, including the cluster itself, the pods, or the nodes.
- Operator Lifecycle Manager (OLM) status
- Evaluate the status of the IBM Product Master operator condition. This Kubernetes operator status represents the intent to maintain the operator. By examining the status, you can determine whether the operator is installed successfully, encountering any issues, or awaiting updates.
Pods keep restarting
- Causes
- When a Kubernetes pod fails to start with a 1/1 status, the first step is to debug the pod. The
following are some methods to debug the pod to locate the cause.
- Check the pod logs by using the following
command.
kubectl logs <pod_name>
-
View the pod events by using the following command.
kubectl describe pod <pod_name>
- Check the exit code of the pod termination by using the logs.
- If a pod is failing due to resource limitations, you can check the resource request and limits for the pod's containers and adjust the limit specification for the service in the ipm_12.0.x_cr.yaml file.
- Pods may continuously restart if either the readiness or liveness probe fails. This can often
be from the Db2® connection or the
HTTP endpoint failures
after 900 seconds.
- Check the pod logs by using the following
command.
- Solution
- Check all the Kubernetes or OpenShift®pods available to determine whether this is a global component or specific to the Product Master deployed pods.
Pods are in pending
state
- Causes
- Pods can go into pending state if they cannot be scheduled onto a node. The Product Master pods
have 3 toleration types within Kubernetes. The pods are not created if,
- Kubernetes worker nodes are not ready,
- Node is unreachable by the Kubernetes node controller within 300 seconds,
- Worker nodes have been tainted with memory pressure.
- Solution
- If the taint from the worker node suggests that you have exhausted the supply of CPU or memory in your cluster, update cluster sizing.
Pods are running, but the application URLs are not loading
- Causes
- During the process of the Kubernetes Product Master operator reconciliation, the controller
manager creates a Product Master ingress and routing rule against 2 services
(
adminui
andpersonaui
) for the Product Master application URLs. When a route is created, the built-in load balancer picks the route to expose the Product Master requested service. If the route fails during the operator reconciliation, the controller manager pod logs highlight the error. Red Hat OpenShift (only) network policy group needs to be designated as the ingress.
- Solution
- Change the StorageClass type in an existing Product Master deployment.
- Update the ipm_12.0.x_cr.yaml file to make replica count
of the pod to
0
and apply again. - Delete the existing
PersistentVolumeClaim
used by the pod. - Update the
StorageClass
in the ipm_12.0.x_cr.yaml file and apply again.
- Update the ipm_12.0.x_cr.yaml file to make replica count
of the pod to
Error while creating pods
- Symptoms
Error creating: pods "productmaster-elasticsearch-fxxxxxx-" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
- Solution
- In the elasticsearch section of the
ipm_12.0.x_cr.yaml file, update the value of the
privileged
property tofalse
and apply again.
Database connection errors
- Solution
- Before you start deployment, ensure that the following connections are open.
- Red Hat OpenShift or Kubernetes platforms and the database server
- Red Hat OpenShift or Kubernetes platforms and the Bluemix registry (registry.ng.bluemix.net/product_master)
Failing IBM MQ pod
- Symptoms
Creating queue manager: Permission denied attempting to access an INI file." then please change the storageclass to from file storage to block storage
- Solution
- If you are using NFS storage file storage for the IBM MQ
pod, then in the mq section of the
ipm_12.0.x_cr.yaml file, change the value of the
storage
property toblock
and apply again.
No route to host (Host unreachable) error
- Symptoms
- Error opening socket to the server
(dbserver.somedomain.com/xx.xx.xx.xx) on port 52,332 with
the following message in the ipm.log file of the Admin UI
pod.
No route to host (Host unreachable)
- Causes
- The error indicates a database connection issue.
- Solution
- Verify whether database connection is getting established in your environment. You can run the
following commands to test the database
connection.
kubectl exec -it <pod name> -- /bin/bash source /home/default/.bash_profile cd $TOP/bin/ ./test_db.sh
Deploying multiple Product Master pods failing
- Symptoms
- When you try to deploy multiple instances of the Product Master pods, the deployment fails.
- Causes
- The deployment fails because the exposed ports are already occupied by first instance of deployment.
- Solution
- In the ipm_12.0.x_cr.yaml file, update all the ports of
the
ext_port
property to unique, and apply again. This avoids conflict with existing Product Master deployment.
Admin UI pod shows error after deployment
- Symptoms
- In some OpenShift environments Admin UI pod displays following error after deployment.
- Solution
- Run the following command on the OpenShift environment and refresh the page.
openshift: oc get --namespace openshift-ingress-operator ingresscontrollers/default –output
Hazelcast service error
- Symptoms
- Though the Hazelcast service is running, the Scheduler pod is unable to connect with following error.
- Causes
- The Hazelcast service is blocking the Scheduler service.
- Solution
- To open the Scheduler service, apply the following hz-sch-networkpolicy.yaml file to each deployment.
MongoDB pod-related
- Symptoms
- MongoDB pod fails to run with either of following errors.
Another mongod instance is already running on the /data/db directory, terminating
No space left on device
- Solution
- Change the storage class from IBM Cloud File Storage (ibmc-file-gold-gid) to IBM Cloud Block Storage (ibmc-block-gold) in the Persistent Volume Claim for MongoDB on the IBM Cloud Public (ROKS) cluster.