Pod Issues
- A GUI pod is stuck in 3/4 running containers with multiple restarts
- Error: daemon and kernel extension do not match
- Error: could not insert module, required key not available
- MountVolume.SetUp failed for volume "ssh-keys"
- Unable to retrieve some image pull secrets
- GUI or Grafana bridge pods fails to start, no data returned from pmcollector to front end applications
- pmcollector pod is in pending state during OpenShift Container Platform upgrade or reboot
- pmsensors that shows null after failure of pmcollector node
- pmcollector pods are not in Running state
- pid_limits set higher than podPidLimits, but not being honored
A GUI pod is stuck in 3/4 running containers with multiple restarts
During an upgrade or a rolling pod update, the Liberty container in a GUI pod has multiple restarts but does not recover.
kubectl get pod -n ibm-spectrum-scale | grep gui
NAME READY STATUS RESTARTS AGE
ibm-spectrum-scale-gui-0 3/4 Running 705 (4m18s ago) 3d
ibm-spectrum-scale-gui-1 4/4 Running 1 (17h ago) 2d16h
Manually delete the pod to fix the issue:
-
Identify the problematic GUI pod:
kubectl get pod -n ibm-spectrum-scale | grep gui
ibm-spectrum-scale-gui-0 3/4 Running 705 (4m18s ago) 3d
-
Delete the GUI pod:
kubectl delete pod -n ibm-spectrum-scale <gui_pod_name>
Example output:
kubectl delete pod -n ibm-spectrum-scale ibm-spectrum-scale-gui-0 pod "ibm-spectrum-scale-gui-0" deleted
-
Verify that the GUI pod recovers with 4/4 containers in READY.
kubectl delete pod -n ibm-spectrum-scale <gui_pod_name>
ibm-spectrum-scale-gui-0 4/4 Running 0 4m ibm-spectrum-scale-gui-1 4/4 Running 0 16m
-
Verify that the GUI pods are at the correct level.
for pod in `kubectl get pods -lapp.kubernetes.io/name=gui -n ibm-spectrum-scale -ojson | jq -r .items[].metadata.name`; do echo -e "\n==== $pod ====" kubectl logs $pod -n ibm-spectrum-scale -cliberty | grep 'GPFS GUI' done
==== ibm-spectrum-scale-gui-0 ==== GPFS GUI Version:5.2.3-0 GPFS GUI Build Date: 20241119-1908 ==== ibm-spectrum-scale-gui-1 ==== GPFS GUI Version:5.2.3-0 GPFS GUI Build Date: 20241119-1908
Error: daemon and kernel extension do not match
This error occurs when an unintentional upgrade of IBM Storage Scale container native happens and the issue presents itself as the GPFS state
is down
. The following error can be found in /var/adm/ras/mmfs.log.latest
on the core pods.
Error: daemon and kernel extension do not match
To prevent this issue, follow proper upgrade procedures as the kernel module cannot be unloaded when a file system is in use.
To resolve this problem, restart the node or follow procedures to remove application workloads and use the following command to unload the kernel module. For more information, see Removing applications.
rmmod tracedev mmfs26 mmfslinux
Error: could not insert module, required key not available
This error occurs when secure boot
is enabled on Red Hat OpenShift nodes and the issue presents itself as the GPFS state
is down
by entering the following command:
kubectl exec -n ibm-spectrum-scale \
$(kubectl get pods -lapp.kubernetes.io/name=core -n ibm-spectrum-scale -ojsonpath="{.items[0].metadata.name}") \
-- mmgetstate -a
The output appears as shown:
Node number Node name GPFS state
-------------------------------------------
1 worker0 active
2 worker1 down
3 worker2 active
The following error can be found in /var/adm/ras/mmfs.log.latest
on the core pods that have GPFS state down
.
ERROR: could not insert module /lib/modules/4.18.0-372.53.1.el8_6.x86_64/extra/tracedev.ko: Required key not available
To verify the secure boot
state and resolve the problem, see Validate secure boot
MountVolume.SetUp failed for volume ssh-keys
Warning FailedMount 83m (x5 over 83m) kubelet, worker-0.example.ibm.com MountVolume.SetUp failed for volume "ssh-keys" : secret "ibm-spectrum-scale-ssh-key-secret" not found
Check the pod creation and secret creation times. It is common that the ssh-key-secret
is created after the deployment of the core pods and the pods cannot find the secret as it does not exist yet.
The message can be misleading as it takes time for the operator to create all the resources. This error is transient and will resolve itself after the secret is present.
If the core pods are not in Running
state and the secret is created and present for some time, deleting the core pods resolves the issue. This action causes the pods to be re-created and mounts the secret successfully.
Unable to retrieve some image pull secrets
Starting with Red Hat OpenShift 4.15 (kubernetes 1.28), warning messages appear when secrets are referenced in service account but are not created in the namespace. Configuring ICR entitlement using the Red Hat OpenShift global pull secret result in these messages appearing in the events.
"Unable to retrieve some image pull secrets (...); attempting to pull the image may not succeed."
-
If the warning is for
ibm-entitlement-key
Resolve this warning by defining namespace pull secrets. For more information, see Namespace Pull Secrets
-
If the warning is for
ibm-spectrum-scale-registrykey
This secret is deprecated in IBM Storage Scale container native v5.2.2.0. To remove the warning messages the secret name needs to be removed from the service accounts.
Run the following script to clean up the service accounts:
for ns in ibm-spectrum-scale ibm-spectrum-scale-operator ibm-spectrum-scale-dns; do for sa in `kubectl get serviceaccount -n $ns -ojson | jq -r .items[].metadata.name | grep ibm-`; do kubectl get serviceaccount $sa -n $ns -o json | \ jq '(.imagePullSecrets | map(select(.name != "ibm-spectrum-scale-registrykey"))) as $newSecrets | .imagePullSecrets = $newSecrets' | \ kubectl apply -f - done done
After namespace secrets are created, the pods need to restart for the warning messages to quiesce. For core pods only, use the following annotation to have the operator orchestrate the pod restarts, preserving quorum: kubectl annotate pod -lapp.kubernetes.io/name=core scale.spectrum.ibm.com/pending=delete
.
For other pods, you must delete them manually.
GUI or Grafana bridge pods fails to start, no data returned from pmcollector to front end applications
There exists an issue where no data is returned to front end applications that are actively consuming performance metrics from IBM Storage Scale pmcollector. This also has a signature of Grafana Bridge pod failing to start. If this is experienced, apply the following workaround.
-
Check NodeNetworkConfigurationPolicy's to determine which network interfaces are configured for a node network.
-
List the NodeNetworkConfigurationPolicies
kubectl get nnce
Example:
# kubectl get nnce NAME STATUS compute-0.mycluster.example.com.bond1-ru5-policy SuccessfullyConfigured compute-1.mycluster.example.com.bond1-ru6-policy SuccessfullyConfigured compute-2.mycluster.example.com.bond1-ru7-policy SuccessfullyConfigured control-0.mycluster.example.com.bond1-ru2-policy SuccessfullyConfigured control-1.mycluster.example.com.bond1-ru3-policy SuccessfullyConfigured control-2.mycluster.example.com.bond1-ru4-policy SuccessfullyConfigured
-
Describe the
NodeNetworkConfigurationPolicy
to identify the network interface being used.Example:
# kubectl describe nnce compute-0.mycluster.example.com.bond1-ru5-policy | grep Name Name: compute-0.mycluster.example.com.bond1-ru5-policy Namespace: Name: bond1-ru5-policy Name: bond1 Name: bond1.3201
In this particular example, the bond interfaces are configured for the node network traffic.
-
-
Change the Performance Data Collection rules to limit the discovery of the Network adapters to only to the configured interfaces.
-
Stop the sensors activities on all Core nodes
kubectl get pods -lapp.kubernetes.io/name=core -n ibm-spectrum-scale \ -ojsonpath="{range .items[*]}{.metadata.name}{'\n'}" | \ xargs -I{} kubectl exec {} -n ibm-spectrum-scale -c gpfs -- \ kill $(pgrep -fx '/opt/IBM/zimon/sbin/pmsensors -E /opt/IBM/zimon -C /etc/scale-pmsensors-configuration/ZIMonSensors.cfg -R /var/run/perfmon')
-
Review the current filter settings for the Network sensor in the Performance Data Collection rules. These are stored in the
ibm-spectrum-scale-pmsensors-config
configmap.kubectl describe cm ibm-spectrum-scale-pmsensors-config -n ibm-spectrum-scale | grep filter | grep netdev
Example output:
# kubectl describe cm ibm-spectrum-scale-pmsensors-config -n ibm-spectrum-scale | grep filter | grep netdev filter = "netdev_name=veth.*|docker.*|flannel.*|cali.*|cbr.*"
The
filter =
output is used for exclusion logic. -
Edit the
ibm-spectrum-scale-pmsensors-config
configmap with the following command:kubectl edit ibm-spectrum-scale-pmsensors-config -n ibm-spectrum-scale
Replace the substring
netdev_name=veth.*|docker.*|flannel.*|cali.*|cbr.*
withnetdev_name=^((?!bond).)*
Bond interface is being used in this example. Replace the bond with the respective adapter name that is used by the customer's network interface.
-
Verify that the
ibm-spectrum-scale-pmsensors-config
configmap now reflects the wanted adapter.kubectl describe cm ibm-spectrum-scale-pmsensors-config -n ibm-spectrum-scale|grep filter | grep netdev
-
-
Clean up the metadata keys in the pmcollector database not related to the configured node network interfaces. Remote the shell into each pmcollector pod and issue the following commands.
kubectl -n ibm-spectrum-scale exec -c pmcollector -it \ $(kubectl get pods -lapp.kubernetes.io/name=pmcollector -o jsonpath='{.items[0].metadata.name}') -- sh
echo "delete key .*|Network|[a-f0-9]{15}|.*" | /opt/IBM/zimon/zc 0
echo "topo -c -d 6" | /opt/IBM/zimon/zc 0| grep Network | cut -d'|' -f2-3 | sort | uniq -c | sort -n | tail -50
Then, exit the container.
Example:
# kubectl -n ibm-spectrum-scale exec -c pmcollector -it \ $(kubectl get pods -lapp.kubernetes.io/name=pmcollector -o jsonpath='{.items[0].metadata.name}') -- sh sh-4.4$ echo "delete key .*|Network|[a-f0-9]{15}|.*" | /opt/IBM/zimon/zc 0 sh-4.4$ echo "topo -c -d 6" | /opt/IBM/zimon/zc 0| grep Network | cut -d'|' -f2-3 | sort | uniq -c | sort -n | tail -50 96 Network|bond0 96 Network|bond1 96 Network|bond1.3201 96 Network|lo sh-4.4$ exit # kubectl -n ibm-spectrum-scale exec -c pmcollector -it \ $(kubectl get pods -lapp.kubernetes.io/name=pmcollector -o jsonpath='{.items[0].metadata.name}') -- sh sh-4.4$ echo "delete key .*|Network|[a-f0-9]{15}|.*" | /opt/IBM/zimon/zc 0 sh-4.4$ echo "topo -c -d 6" | /opt/IBM/zimon/zc 0| grep Network | cut -d'|' -f2-3 | sort | uniq -c | sort -n | tail -50 96 Network|bond0 96 Network|bond1 96 Network|bond1.3201 96 Network|lo sh-4.4$ exit
-
Start the sensors jobs on all Core nodes
kubectl get pods -lapp.kubernetes.io/name=core -n ibm-spectrum-scale \ -ojsonpath="{range .items[*]}{.metadata.name}{'\n'}" | \ xargs -I{} kubectl exec {} -n ibm-spectrum-scale -c gpfs -- \ /opt/IBM/zimon/sbin/pmsensors -E /opt/IBM/zimon -C /etc/scale-pmsensors-configuration/ZIMonSensors.cfg -R /var/run/perfmon
-
Delete the pmcollector and Grafana bridge pods to update the configuration changes.
kubectl delete pod -lapp.kubernetes.io/instance=ibm-spectrum-scale,app.kubernetes.io/name=pmcollector kubectl delete pod -lapp.kubernetes.io/instance=ibm-spectrum-scale,app.kubernetes.io/name=grafanabridge
After some time, the pmcollector and Grafana bridge pods are redeployed by the ibm-spectrum-scale-operator.
pmcollector pod is in pending state during OpenShift Container Platform upgrade or reboot
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 65s (x202 over 4h43m) default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master:}, that the pod didn't tolerate.
This issue is caused by a problem during OpenShift Container Platform Upgrade or when a worker node has not been reset to schedulable after reboot. The pmcollector remains in a Pending
state until the pod itself and its respective Persistent
Volume can be bound to a worker node.
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master0.example.com Ready master 5d18h v1.18.3+2fbd7c7
master1.example.com Ready master 5d18h v1.18.3+2fbd7c7
master2.example.com Ready master 5d18h v1.18.3+2fbd7c7
worker0.example.com Ready worker 5d18h v1.17.1+45f8ddb
worker1.example.com Ready,SchedulingDisabled worker 5d18h v1.17.1+45f8ddb
worker2.example.com Ready worker 5d18h v1.17.1+45f8ddb
If the Persistent Volume has Node Affinity
to the host that has SchedulingDisabled
, the pmcollector pod remains in Pending
state until the node associated with the PV becomes schedulable.
# kubectl describe pv worker1.example.com-pv
Name: worker1.example.com-pv
Labels: app=scale-pmcollector
Annotations: pv.kubernetes.io/bound-by-controller: yes
Finalizers: [kubernetes.io/pv-protection]
StorageClass: ibm-spectrum-scale-internal
Status: Bound
Claim: example/datadir-ibm-spectrum-scale-pmcollector-1
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 25Gi
Node Affinity:
Required Terms:
Term 0: kubernetes.io/hostname in [worker1.example.com]
Message:
Source:
Type: LocalVolume (a persistent volume backed by local storage on a node)
Path: /var/mmfs/pmcollector
If the issue was with OpenShift Container Platform upgrade, fixing the upgrade issue should resolve the pending pod.
If the issue is due to worker node in SchedulingDisabled
state and not due to a failed OpenShift Container Platform upgrade, re-enable scheduling for the worker with the oc adm uncordon
command.
pmsensors that shows null after failure of pmcollector node
If a node that is running the pmcollector pod is drained, when the node is uncordoned, the pmcollector pods get new IPs assigned. This leads to the pmsensors process issue. It displays the following message:
Connection to scale-pmcollector-0.scale-pmcollector successfully established.
But an error is reported:
Error on socket to scale-pmcollector-0.scale-pmcollector: No route to host (113)
See /var/log/zimon/ZIMonSensors.log
. This issue can also be seen on the pmcollector pod:
# echo "get metrics cpu_user bucket_size 5 last 10" | /opt/IBM/zimon/zc 0
1: worker1
2: worker2
Row Timestamp cpu_user
1 2020-11-16 05:27:25 null
2 2020-11-16 05:27:30 null
3 2020-11-16 05:27:35 null
4 2020-11-16 05:27:40 null
5 2020-11-16 05:27:45 null
6 2020-11-16 05:27:50 null
7 2020-11-16 05:27:55 null
8 2020-11-16 05:28:00 null
9 2020-11-16 05:28:05 null
10 2020-11-16 05:28:10 null
If the scale-pmcollector pods get their IP addresses changed, the pmsensors process needs to be stopped and restarted manually on all scale-core pods, to get the performance metrics collection resumed.
To stop the pmsensor process, run these commands on all the ibm-spectrum-scale-core pods. The PMSENSORPID
variable holds the results of the kubectl exec
command. If this variable is empty, then no process is running, and
you do not need to enter the following command to stop the process.
PMSENSORPID=`kubectl exec <ibm-spectrum-scale-core> -n ibm-spectrum-scale -- pgrep -fx '/opt/IBM/zimon/sbin/pmsensors -C /etc/scale-pmsensors-configuration/ZIMonSensors.cfg -R /var/run/perfmon'`
echo $PMSENSORPID
kubectl exec <scale-pod> -n ibm-spectrum-scale -- kill $PMSENSORPID
To start the service again, enter this command on all the scale pods.
kubectl exec <scale-pod> -n ibm-spectrum-scale -- /opt/IBM/zimon/sbin/pmsensors -C /etc/scale-pmsensors-configuration/ZIMonSensors.cfg -R /var/run/perfmon
pmcollector pods are not in Running state
Reasons to fix the pmcollector pods:
- Pods cannot be rescheduled on previously scheduled nodes. The pmcollector nodes were deleted or unrecoverable.
-
The pmcollector pods are not updated after updates to the pmcollector's statefulset toleration.
-
Stop the pmcollector pods by scaling the statefulset down to 0:
kubectl scale statefulset.apps/ibm-spectrum-scale-pmcollector -n ibm-spectrum-scale --replicas=0
-
Delete the pmcollector's persistent volume claims and persistent volumes:
kubectl delete pvc -lapp.kubernetes.io/instance=ibm-spectrum-scale,app.kubernetes.io/name=pmcollector -n ibm-spectrum-scale kubectl delete pv -lapp.kubernetes.io/instance=ibm-spectrum-scale,app.kubernetes.io/name=pmcollector -n ibm-spectrum-scale
-
Start the pm pmcollector pods by scaling up the statefulset to 2:
kubectl scale statefulset.apps/ibm-spectrum-scale-pmcollector -n ibm-spectrum-scale --replicas=2
-
Stop the IBM Storage Scale container native operator:
kubectl scale deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator --replicas=0
-
Start the IBM Storage Scale container native operator:
kubectl scale deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator --replicas=1
-
Verify the pmcollector pods restarted successfully:
-
Verify that the operator created the persistent volumes and are Bound.
kubectl get pv | grep pmcollector
-Verify the persistent volume claims are Bound.
kubectl get pvc -o wide -n ibm-spectrum-scale | grep pmcollector
-
Verify the pmcollector pods are running all containers (2/2) on the intended nodes.
kubectl get pods -o wide -n ibm-spectrum-scale | grep pmcollector
-
pid_limits set higher than podPidLimits, but not being honored
With Red Hat OpenShift Container Platform 4.11, certain CRI-O fields introduced before the support was in kubelet have been deprecated. One of those deprecated fields is pids_limit
, that was configured in the ContainerRuntimeConfig
CR. For more information, see CRI-O should deprecate log size max and pids limit options.
If you had applied a custom MCO configuration with a pids_limit
value higher than 4096, the container limits are restricted by the default podPidsLimit
value in kubelet.conf
. This default is set to 4096 on
OpenShift Container Platform 4.11, and later. In order to increase this value, do the following:
It is highly recommended that you are at IBM Storage Scale container native v5.1.5 or higher before making changes to MachineConfig
as the IBM Storage Scale container native operator will orchestrate the updates to MachineConfig
as an attempt to keep the IBM Storage Scale cluster operational.
-
Define the
podPidsLimit
in theKubeletConfig
custom resource.apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: 01-worker-ibm-spectrum-scale-increase-pid-limit spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: '' kubeletConfig: podPidsLimit: 8192
-
Delete the IBM Storage Scale container native
ContainerRuntimeConfig
resource in order to set the default value for ContainerRuntime to 0, effective unlimited:kubectl delete ContainerRuntimeConfig 01-worker-ibm-spectrum-scale-increase-pid-limit