Global Data Platform service issues
This section lists the troubleshooting tips and tricks when you use IBM Fusion storage.
Warning: Do NOT delete IBM Storage Scale pods. Deletion of Scale pods in
many circumstances has implications on availability and data integrity.
Filesystem status from Remote file systems is stuck in Connecting state
- Cause
- The secure boot is not supported by Scale and CNSA.
- Resolution
-
- Check whether the nodes are running with secure
boot:
mokutil --sb-stateSecureBoot enabled
- As a root password, you can disable the secure boot with the following
command:
mokutil --disable-validationpassword length: 8~16 input password: input password again:
- Check whether the nodes are running with secure
boot:
Increased CPU usage on Scale pods after upgrading to Cloud Pak for Integration (CP4I)
- Problem statement
- Scale pod CPU usage has increased due to changes in MQ streams application. It is because of IBM MQ's liveness and readiness probes.
- Resolution
- The file handle reservation can be turned off by setting the environment variable
AMQ_NO_RESERVE_FD=1. It fixes the high CPU usage.
Scale operator failed to add quorum label
- Problem statement
- During Storage configuration, the recovery group status can be "
Waiting on some daemons in the recovery group to be restarted".
- Resolution
-
- Run the following command to check the count of the quorum node:
oc get nodes -l scale.spectrum.ibm.com/designation=quorum - Check the number of quorum nodes. Whenever the node count is less than 10, there must 3 quorum
nodes. If not, apply the following quorum label for the missing nodes:
scale.spectrum.ibm.com/designation=quorum
- Run the following command to check the count of the quorum node:
Remote filesystem connection goes to a Disconnected state
- Problem statement
- If even one scale core pod is not ready due to node restart or other issues, then a problem occurs on the filesystem mount of that node and the filesystem CR reports errors. As a result, the filesystem goes to a Disconnected and the same is displayed on the IBM Fusion user interface.
- Resolution
- Wait for the scale core pod to go to Ready state, and the filesystem automatically changes to Connected state.
PVCs stuck in pending state
- Resolution
- If PVCs are stuck in a pending state for a long time, restart Scale GUI pods.
Nodes with pods scheduled for storage results in a crashloobackoff or
container creating errors
- Problem statement
- When install, upgrade, or upsize operations are in progress and the storage configuration is not
yet done, the nodes with pods scheduled for storage results in a
crashloobackoffor container creating errors.
- Resolution
- To resolve this issue, disable schedule on these nodes before you begin these operations, and schedule the pods on nodes that are configured for storage.
File system is not mounted on some nodes
- Problem statement
- After IBM Storage Scale upgrade, the file system is not mounted on some nodes.
- Resolution
- Do the following workaround steps:
- Delete the problematic pod where the file system is not mounted. Delete one pod at a time.
- Refresh the pod overview page until the pod appears. Run
watch oc get pods. - When the problematic pod comes up, set the replicas to 0 in the
Specsection in theibm-spectrum-scale-operatornamespace. You must stop the operator deployment immediately. Ensure that no operator pod is running inibm-spectrum-scale-operatornamespace. Command for setting the operator replica to 0:oc patch deploy ibm-spectrum-scale-controller-manager \ --type='json' -n ibm-spectrum-scale-operator \ -p='[{"op": "replace", "path": "/spec/replicas", "value": 0}]'
Sample output:oc get deployment -n ibm-spectrum-scale-operatorI0420 16:56:23.579340 4184 request.go:645] Throttling request took 1.192496025s, request: GET:https://api.isf-rackk.rtp.raleigh.ibm.com:6443/apis/nmstate.io/v1beta1?timeout=32s NAME READY UP-TO-DATE AVAILABLE AGE ibm-spectrum-scale-controller-manager 0/0 0 0 28h - Wait till the problematic pod is in
initcontainer state and run the following command:oc exec podname -- mmsdrrestore -p <any working core ip/name[Specify pod IP where FS is mounted]> - After the previous step is successful, set the replicas to 1 again in the
ibm-spectrum-scale-operatordeployment.Ensure that the operator pod is running in
ibm-spectrum-scale-operatornamespace.Run the following command for setting the operator replica to 0:oc patch deploy ibm-spectrum-scale-controller-manager \ --type='json' -n ibm-spectrum-scale-operator \ -p='[{"op": "replace", "path": "/spec/replicas", "value": 1}]'oc get deployment -n ibm-spectrum-scale-operatorSample output:I0420 16:56:23.579340 4184 request.go:645] Throttling request took 1.192496025s, request: GET:https://api.isf-rackk.rtp.raleigh.ibm.com:6443/apis/nmstate.io/v1beta1?timeout=32s NAME READY UP-TO-DATE AVAILABLE AGE ibm-spectrum-scale-controller-manager 1/1 0 0 28h - Wait for some time for the problematic pod to come in running state and file system gets mounted on this problematic node.
IBM Storage Scale user interface does not
load properly or IBM Storage Scale CSI pods are
in CrashLoopBack state
- Problem statement
- Sometimes, IBM Storage Scale user interface
might not load properly or IBM Storage Scale CSI
pods are in
CrashLoopBackstate.
- Resolution
- As a workaround, do the following steps:
- Restart the GUI pods and check whether it is working properly.
- If GUI pod restart does not open the IBM Storage Scale console, then restart the
ibm-spectrum-scale-controller-managerpod and then restart GUI pod.
Not able to log in to IBM Storage Scale user interface
- Cause
- The secret that is mounted in GUI pods /etc/gui-oauth/oauth.properties file
and secret present in the
ibm-spectrum-scale-gui-oauthclientare different.gui-1 sh-4.4$ cd /etc/gui-oauth/ sh-4.4$ cat oauth.properties secret=tjCK6kE0pmCe2d8b7azw oauthServer=https://oauth-openshift.apps.isf-racka.rtp.raleigh.ibm.com redirectURI=https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect clientID=ibm-spectrum-scale-gui-oauthclient k8sAPIServer=https://172.30.0.1:443 sh-4.4$ gui_0 secret=tjCK6kE0pmCe2d8b7azw oauthServer=https://oauth-openshift.apps.isf-racka.rtp.raleigh.ibm.com redirectURI=https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect clientID=ibm-spectrum-scale-gui-oauthclient k8sAPIServer=https://172.30.0.1:443 Oauth client config (API Explorer -> OAuthClient -> ibm-spectrum-scale-gui-oauthclient) kind: OAuthClient apiVersion: oauth.openshift.io/v1 metadata: name: ibm-spectrum-scale-gui-oauthclient uid: a9eb9d85-33b9-4cd7-99c8-ae0213032380 resourceVersion: ‘314932’ creationTimestamp: ‘2022-03-29T17:09:38Z’ labels: app.kubernetes.io/instance: ibm-spectrum-scale app.kubernetes.io/name: gui ownerReferences: apiVersion: scale.spectrum.ibm.com/v1beta1 kind: Gui name: ibm-spectrum-scale-gui uid: da23c696-5c71-405b-b9ac-c7d0ccb4dd43 controller: true blockOwnerDeletion: true managedFields: manager: ibm-spectrum-scale/ibm-spectrum-scale-gui operation: Apply apiVersion: oauth.openshift.io/v1 time: ‘2022-03-29T17:09:38Z’ fieldsType: FieldsV1 fieldsV1: ‘f:grantMethod’: {} ‘f:metadata’: ‘f:labels’: ‘f:app.kubernetes.io/instance’: {} ‘f:app.kubernetes.io/name’: {} ‘f:ownerReferences’: ‘k:{“uid”:“da23c696-5c71-405b-b9ac-c7d0ccb4dd43"}’: .: {} ‘f:apiVersion’: {} ‘f:blockOwnerDeletion’: {} ‘f:controller’: {} ‘f:kind’: {} ‘f:name’: {} ‘f:uid’: {} ‘f:redirectURIs’: ‘v:“https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect”’: {} ‘f:secret’: {} *secret: bk2_q1FqwhVhoNqOl7ob* redirectURIs: >- https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect grantMethod: auto
- Resolution
- As a resolution, do the following steps:
- As a resolution, restart the GUI pods.
- If the previous step (step 1) does not resolve the issue, then restart IBM Storage Scale Operator pod and then restart the GUI pods.
Download log files of all deployments
If you cannot download log files of all deployments, individually download the log files.
Issues in node drain and node restart
- Resolution
- To resolve such node related issues, see Issues related to IBM Fusion HCI System node drains.
IBM Storage Scale cluster failed to recover after powering off and on the compute nodes
- Diagnosis
- Identify if the nodes joining the cluster are in an out-of-tolerance situation.
- mmhealth reported UNKNOWN RG status on some nodes.
- mmrepquota commands have been generated repeatedly, likely by the GUI, which has caused many long waits.
- Due to the large number of long waits, the cluster manager node was not responding to any
ts cmd other than tsctl nqStatus.
As a resolution, follow the steps defined in the Recover storage cluster from an unplanned multi-node restart or failure.
MCO rollout is not progressing for more than 5 hours during the scale upgrade
- Cause
- Whenever the nodes are over committed and no pod limit is available, you might see delays in the policy rollout, ICSP, pull secret, and MCO rollout changes.
- Resolution
- As a resolution, do the following step:
- IBM Fusion HCI System nodes allow the maximum number of pods on a node as defined by OpenShift® Container Platform default and ensure to keep that count within the limit. Otherwise, you might see this issue.
IBM Storage Scale cluster status is in a critical state after power on/off the rack
- Cause
- IBM Storage Scale cluster status is in
critical state due to
unmounted_fs_checkis present on thefilesystemCR.
- Resolution
- As a resolution, wait until this
filesystemCR status gets cleared; after that, the IBM Storage Scale cluster status automatically returns to a healthy state. If the issue persists for a long time, then contact IBM support .
Disk upsize issues
- Manual steps to test the
/devmount:After you add a new NVMe disk to the server, the new disk cannot be discovered in the IBM Storage Scale pod. The disks do not show up with /dev/nvme* device names in the POD, IBM Storage Scale need the device name /dev/nvme* to access the disks. Without the /dev/nvme* device name, the new disks cannot be added into recovery group.
- Resolution
- Do the following manual steps to mount disks to /dev in the pod.
- Log in to each IBM Storage Scale storage
pods.
oc exec -it <pod name> -c <container> /bin/sh - Stop the GPFS daemons on one or more nodes:
mmshutdown - Mount
/devfrom the host running the following command:mount -t devtmpfs none /dev - Validate that the content of /dev is from host and you can see sub
directories like block, bus, char.
mmstartup
- Log in to each IBM Storage Scale storage
pods.