OCP troubleshooting
See the following information to troubleshoot RedHat OpenShift Container Platform issues. Note that these troubleshooting topics apply only to the OCP version of Agile Service Manager, and not the on-prem versions.
Halt a Cassandra node for maintenance
Sometimes you need to manually maintain your Cassandra server nodes. For example, you might need to halt one node for long enough to allow you to complete any required maintenance work or debug a problem.
- To reduce the impact of pausing a Cassandra node on the performance of your remaining system resources, keep such maintenance periods to a minimum.
- The commands used here assume that the node being halted is
noi-cassandra-0
. - This troubleshooting topic applies only to Agile Service Manager on OCP (with NOI).
- Halt the node
-
- Edit the bootstrap configMap.
For example, if you want to halt node 0, you can set the following columns.oc edit configmaps noi-cassandra-bootstrap-config
hostname_filter: noi-cassandra-0 running_mode: maintenance
- Save and exit.
- To restart the target node pod, you delete the pod so that the statefulSet will restart the pod
automatically. Then it should be halted.
oc delete pod noi-cassandra-0
- Edit the bootstrap configMap.
- Debug the Cassandra server
- After the node is restarted and halted, you should see that its state stays as
0/1 Running
. When the pod is in this mode, the Cassandra server is not automatically started in it. The 'liveness' probe and 'readiness' probe will not be activated for many days in this pod, allowing you to perform any maintenance that is required.
- Return the node to normal
- After you complete your manual work, you can edit the configMap and restart the pod to return it
back to normal.
oc edit configmaps noi-cassandra-bootstrap-config
- Restart normal operations
-
Finally, you can delete the pod again so it will revert to the normal node and continue to work in the cluster.
Network discovery errors:
'cannot resolve bootstrap urls'
'Sherpa
service nginx gateway timeout'
-
kubectl get pods --namespace=openshift-dns
-
kubectl delete --all pods --namespace=openshift-dns
-
kubectl get pods --namespace=openshift-dns
Services not binding to storage (after upgrade or uninstall)
Some services fail to bind to the provisioned storage, typically resulting in pods stuck in 'pending' state.
After removing a previous installation of Agile Service Manager and some of its PersistentVolumeClaim (PVC) objects, any associated PersistentVolume (PV) objects are placed in a 'Released' state. They are now unavailable for bonding, even if new PVCs that are part of a new Agile Service Manager installation have the same name and namespace. This is an important security feature to safeguard the previous PV data.
$ kubectl get pod -l app=elasticsearch
NAME READY STATUS RESTARTS AGE
asm-elasticsearch-0 0/1 ContainerCreating 0 4s
asm-elasticsearch-1 0/1 Pending 0 3s
asm-elasticsearch-2 0/1 Pending 0 3s
This
example examines the state of the PersistentVolumeClaims and the (truncated) result indicates that
the status is 'pending'.
$ kubectl get pvc -l app=elasticsearch
NAME STATUS VOLUME
data-asm-elasticsearch-0 Bound asm-data-elasticsearch-0
data-asm-elasticsearch-1 Pending
data-asm-elasticsearch-2 Pending
This
example examines the PersistentVolumes and the (truncated) result indicates that the status is
'released'.$ kubectl get pv -l app=elasticsearch
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS
asm-data-elasticsearch-0 75Gi RWO Retain Bound
asm-data-elasticsearch-1 75Gi RWO Retain Released
asm-data-elasticsearch-2 75Gi RWO Retain Released
PV.Spec.ClaimRef.UID
field from the
PV objects to make the PV available again. The following (truncated) example shows a PV that is
bound to a specific
PVC:apiVersion: v1
kind: PersistentVolume
spec:
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: data-asm-elasticsearch-1
namespace: default
resourceVersion: "81033"
uid: 3dc73022-bb1d-11e8-997a-00000a330243
To
solve the problem, you edit the PV object and remove the uid
field, after which the
PV status changes to 'Available', as shown in the following
example:$ kubectl get pv -l app=elasticsearch
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS
asm-data-elasticsearch-0 75Gi RWO Retain Bound
asm-data-elasticsearch-1 75Gi RWO Retain Available
asm-data-elasticsearch-2 75Gi RWO Retain Available
User interface timeout errors
- topologyServiceTimeout
- searchServiceTimeout
- layoutServiceTimeout
kubectl edit configmap {{ .Release.Name }}-asm-ui-config
When done,
restart the NOI webgui pod.