Troubleshooting the operator issues

Use the following topics to resolve common issues with the operator-based deployment.

Troubleshooting issues in a Kubernetes environment can be challenging due to the complexity of interconnected components.  It often involves reviewing the following circumstances to pinpoint the root cause of the problem.

Recent changes
Check what has recently changed in your Kubernetes setup, including the cluster itself, the pods, or the nodes.  
Operator Lifecycle Manager (OLM) status
Evaluate the status of the IBM Product Master operator condition.  This Kubernetes operator status represents the intent to maintain the operator.  By examining the status, you can determine whether the operator is installed successfully, encountering any issues, or awaiting updates. 

Pods keep restarting

Causes
When a Kubernetes pod fails to start with a 1/1 status, the first step is to debug the pod. The following are some methods to debug the pod to locate the cause.
  1. Check the pod logs by using the following command.
    kubectl logs <pod_name>
  2. View the pod events by using the following command.
    kubectl describe pod <pod_name>
  3. Check the exit code of the pod termination by using the logs.
  4. If a pod is failing due to resource limitations, you can check the resource request and limits for the pod's containers and adjust the limit specification for the service in the ipm_12.0.x_cr.yaml file.
  5. Pods may continuously restart if either the readiness or liveness probe fails.  This can often be from the Db2® connection or the HTTP endpoint failures after 900 seconds.
Solution
Check all the Kubernetes or OpenShift®pods available to determine whether this is a global component or specific to the Product Master deployed pods.

Pods are in pending state

Causes
Pods can go into pending state if they cannot be scheduled onto a node.  The Product Master pods have 3 toleration types within Kubernetes.  The pods are not created if,
  • Kubernetes worker nodes are not ready,
  • Node is unreachable by the Kubernetes node controller within 300 seconds,
  • Worker nodes have been tainted with memory pressure. 
Solution
If the taint from the worker node suggests that you have exhausted the supply of CPU or memory in your cluster, update cluster sizing.

Pods are running, but the application URLs are not loading

Causes
During the process of the Kubernetes Product Master operator reconciliation, the controller manager creates a Product Master ingress and routing rule against 2 services (adminui and personaui) for the Product Master application URLs.  When a route is created, the built-in load balancer picks the route to expose the Product Master requested service.  If the route fails during the operator reconciliation, the controller manager pod logs highlight the error. Red Hat OpenShift (only) network policy group needs to be designated as the ingress. 
Solution
Change the StorageClass type in an existing Product Master deployment.
  1. Update the ipm_12.0.x_cr.yaml file to make replica count of the pod to 0 and apply again.
  2. Delete the existing PersistentVolumeClaim used by the pod.
  3. Update the StorageClass in the ipm_12.0.x_cr.yaml file and apply again.

Error while creating pods

Symptoms
Error creating: pods "productmaster-elasticsearch-fxxxxxx-" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
Solution
In the elasticsearch section of the ipm_12.0.x_cr.yaml file, update the value of the privileged property to false and apply again.

Database connection errors

Solution
Before you start deployment, ensure that the following connections are open.
  • Red Hat OpenShift or Kubernetes platforms and the database server
  • Red Hat OpenShift or Kubernetes platforms and the Bluemix registry (registry.ng.bluemix.net/product_master)

Failing IBM MQ pod

Symptoms
Creating queue manager: Permission denied attempting to access an INI file." then please change the storageclass to from file storage to block storage
Solution
If you are using NFS storage file storage for the IBM MQ pod, then in the mq section of the ipm_12.0.x_cr.yaml file, change the value of the storage property to block and apply again.

No route to host (Host unreachable) error

Symptoms
Error opening socket to the server (dbserver.somedomain.com/xx.xx.xx.xx) on port 52,332 with the following message in the ipm.log file of the Admin UI pod.
No route to host (Host unreachable)
Causes
The error indicates a database connection issue.
Solution
Verify whether database connection is getting established in your environment. You can run the following commands to test the database connection.
kubectl exec -it <pod name> -- /bin/bash
source /home/default/.bash_profile
cd $TOP/bin/
./test_db.sh

Deploying multiple Product Master pods failing

Symptoms
When you try to deploy multiple instances of the Product Master pods, the deployment fails.
Causes
The deployment fails because the exposed ports are already occupied by first instance of deployment.
Solution
In the ipm_12.0.x_cr.yaml file, update all the ports of the ext_port property to unique, and apply again. This avoids conflict with existing Product Master deployment.

Admin UI pod shows error after deployment

Symptoms
In some OpenShift environments Admin UI pod displays following error after deployment.
Solution
Run the following command on the OpenShift environment and refresh the page.
openshift: oc get --namespace openshift-ingress-operator ingresscontrollers/default –output
If the value of the output is HostNetwork then run following command.
oc label namespace default 'network.openshift.io/policy-group=ingress'

Hazelcast service error

Symptoms
Though the Hazelcast service is running, the Scheduler pod is unable to connect with following error.
java.lang.Exception: Hazelcast instance found to be null. Possible reason is unable to connect to any address.
Causes
The Hazelcast service is blocking the Scheduler service.
Solution
To open the Scheduler service, apply the following hz-sch-networkpolicy.yaml file to each deployment.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: np-hazelcast2
namespace: <>
spec:
ingress:
- from:
- podSelector:
matchLabels:
app: productmaster-sch
ports:
- port: 5702
protocol: TCP
podSelector:
matchLabels:
app: productmaster-hazelcast
policyTypes:
- Ingress

MongoDB pod-related

Symptoms
MongoDB pod fails to run with either of following errors.
Another mongod instance is already running on the /data/db directory, terminating
No space left on device
Solution
Change the storage class from IBM Cloud File Storage (ibmc-file-gold-gid) to IBM Cloud Block Storage (ibmc-block-gold) in the Persistent Volume Claim for MongoDB on the IBM Cloud Public (ROKS) cluster.