OCP troubleshooting

See the following information to troubleshoot RedHat OpenShift Container Platform issues. Note that these troubleshooting topics apply only to the OCP version of Agile Service Manager, and not the on-prem versions.

Halt a Cassandra node for maintenance

Sometimes you need to manually maintain your Cassandra server nodes. For example, you might need to halt one node for long enough to allow you to complete any required maintenance work or debug a problem.

The following steps suspend (halt) one Cassandra node while the other nodes still serve queries, which allows you to perform maintenance tasks without shutting down your production environment.
Note:
  • To reduce the impact of pausing a Cassandra node on the performance of your remaining system resources, keep such maintenance periods to a minimum.
  • The commands used here assume that the node being halted is noi-cassandra-0.
  • This troubleshooting topic applies only to Agile Service Manager on OCP (with NOI).
Halt the node
  1. Edit the bootstrap configMap.
    oc edit configmaps noi-cassandra-bootstrap-config
    
    For example, if you want to halt node 0, you can set the following columns.
    hostname_filter: noi-cassandra-0
    running_mode: maintenance
    
  2. Save and exit.
  3. To restart the target node pod, you delete the pod so that the statefulSet will restart the pod automatically. Then it should be halted.
    oc delete pod noi-cassandra-0
Debug the Cassandra server
After the node is restarted and halted, you should see that its state stays as 0/1 Running. When the pod is in this mode, the Cassandra server is not automatically started in it. The 'liveness' probe and 'readiness' probe will not be activated for many days in this pod, allowing you to perform any maintenance that is required.
There are two methods to start the Cassandra server for debug purposes.
  • You can go into the pod container to manually start the server.
    oc exec -it noi-cassandra-0 -- bash
    bash-4.4$ /opt/ibm/cassandra/bin/cassandra -fR > /tmp/server.log 2>&1 &
    
    The server process will be started in the background and the log messages can be found in file /tmp/server.log.
  • You can reuse the entry point script.
    oc exec -it noi-cassandra-0 -- bash
    bash-4.4$ export RUNNING_MODE=normal
    bash-4.4$ /opt/ibm/start-cassandra.sh > /tmp/server.log 2>&1 &
    
Example maintenance task:
Remove corrupt commit logs that are causing Cassandra to not start.
$ oc exec -it noi-cassandra-0 -- bash
bash-4.4$ rm -rf /opt/ibm/cassandra/data/commitlog/CommitLog-6-1682019553495.log
After performing the required maintenance, you can stop the manually started Cassandra server.
oc exec -it noi-cassandra-0 -- bash
bash-4.4$ nodetool stopdaemon
Note: You can also perform other actions, like fixing volumes or transforming the data files. Every time you restart the pod, it will be halted.
Return the node to normal
After you complete your manual work, you can edit the configMap and restart the pod to return it back to normal.
oc edit configmaps noi-cassandra-bootstrap-config
Set the running mode to normal.
hostname_filter: noi-cassandra-0
running_mode: normal
Restart normal operations

Finally, you can delete the pod again so it will revert to the normal node and continue to work in the cluster.

Network discovery errors:
'cannot resolve bootstrap urls'
'Sherpa service nginx gateway timeout'

When these errors occur, restart dns pods using the following commands:
  1. kubectl get pods --namespace=openshift-dns
  2. kubectl delete --all pods --namespace=openshift-dns
    
  3. kubectl get pods --namespace=openshift-dns
Ensure all pods are up and running.

Services not binding to storage (after upgrade or uninstall)

Some services fail to bind to the provisioned storage, typically resulting in pods stuck in 'pending' state.

After removing a previous installation of Agile Service Manager and some of its PersistentVolumeClaim (PVC) objects, any associated PersistentVolume (PV) objects are placed in a 'Released' state. They are now unavailable for bonding, even if new PVCs that are part of a new Agile Service Manager installation have the same name and namespace. This is an important security feature to safeguard the previous PV data.

Investigating the problem: The following example lists the 'elasticsearch' pods and their status, and the result shows the 'pending' status, indicating the problem.
$ kubectl get pod -l app=elasticsearch

NAME                      READY  STATUS             RESTARTS  AGE
asm-elasticsearch-0       0/1    ContainerCreating  0         4s
asm-elasticsearch-1       0/1    Pending            0         3s
asm-elasticsearch-2       0/1    Pending            0         3s
This example examines the state of the PersistentVolumeClaims and the (truncated) result indicates that the status is 'pending'.
$ kubectl get pvc -l app=elasticsearch

NAME                       STATUS    VOLUME
data-asm-elasticsearch-0   Bound     asm-data-elasticsearch-0
data-asm-elasticsearch-1   Pending
data-asm-elasticsearch-2   Pending
This example examines the PersistentVolumes and the (truncated) result indicates that the status is 'released'.
$ kubectl get pv -l app=elasticsearch

NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS
asm-data-elasticsearch-0   75Gi       RWO            Retain           Bound
asm-data-elasticsearch-1   75Gi       RWO            Retain           Released
asm-data-elasticsearch-2   75Gi       RWO            Retain           Released
Solution: As admin user, remove the PV.Spec.ClaimRef.UID field from the PV objects to make the PV available again. The following (truncated) example shows a PV that is bound to a specific PVC:
apiVersion: v1
kind: PersistentVolume
spec:
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-asm-elasticsearch-1
    namespace: default
    resourceVersion: "81033"
    uid: 3dc73022-bb1d-11e8-997a-00000a330243
To solve the problem, you edit the PV object and remove the uid field, after which the PV status changes to 'Available', as shown in the following example:
$ kubectl get pv -l app=elasticsearch

NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS
asm-data-elasticsearch-0   75Gi       RWO            Retain           Bound
asm-data-elasticsearch-1   75Gi       RWO            Retain           Available
asm-data-elasticsearch-2   75Gi       RWO            Retain           Available

User interface timeout errors

To prevent or mitigate UI timeout errors, you can increase the timeout values for the following parameters, which are defined in configmap:
  • topologyServiceTimeout
  • searchServiceTimeout
  • layoutServiceTimeout
To change the timeout values of these (in seconds) edit the configmap using the following command:
kubectl edit configmap {{ .Release.Name }}-asm-ui-config
When done, restart the NOI webgui pod.