Global Data Platform service issues

This section lists the troubleshooting tips and tricks when you use IBM Fusion storage.

Warning: Do NOT delete IBM Storage Scale pods. Deletion of Scale pods in many circumstances has implications on availability and data integrity.

Filesystem status from Remote file systems is stuck in Connecting state

Cause
The secure boot is not supported by Scale and CNSA.
Resolution
  1. Check whether the nodes are running with secure boot:
    mokutil --sb-state
    SecureBoot enabled
  2. As a root password, you can disable the secure boot with the following command:
    mokutil --disable-validation
    password length: 8~16
    input password: 
    input password again: 

Increased CPU usage on Scale pods after upgrading to Cloud Pak for Integration (CP4I)

Problem statement
Scale pod CPU usage has increased due to changes in MQ streams application. It is because of IBM MQ's liveness and readiness probes.
Resolution
The file handle reservation can be turned off by setting the environment variable AMQ_NO_RESERVE_FD=1. It fixes the high CPU usage.

Scale operator failed to add quorum label

Problem statement
During Storage configuration, the recovery group status can be "Waiting on some daemons in the recovery group to be restarted".
Resolution
  1. Run the following command to check the count of the quorum node:
    oc get nodes -l scale.spectrum.ibm.com/designation=quorum
  2. Check the number of quorum nodes. Whenever the node count is less than 10, there must 3 quorum nodes. If not, apply the following quorum label for the missing nodes:
    scale.spectrum.ibm.com/designation=quorum

Remote filesystem connection goes to a Disconnected state

Problem statement
If even one scale core pod is not ready due to node restart or other issues, then a problem occurs on the filesystem mount of that node and the filesystem CR reports errors. As a result, the filesystem goes to a Disconnected and the same is displayed on the IBM Fusion user interface.
Resolution
Wait for the scale core pod to go to Ready state, and the filesystem automatically changes to Connected state.

PVCs stuck in pending state

Resolution
If PVCs are stuck in a pending state for a long time, restart Scale GUI pods.

Nodes with pods scheduled for storage results in a crashloobackoff or container creating errors

Problem statement
When install, upgrade, or upsize operations are in progress and the storage configuration is not yet done, the nodes with pods scheduled for storage results in a crashloobackoff or container creating errors.
Resolution
To resolve this issue, disable schedule on these nodes before you begin these operations, and schedule the pods on nodes that are configured for storage.

File system is not mounted on some nodes

Problem statement
After IBM Storage Scale upgrade, the file system is not mounted on some nodes.
Resolution
Do the following workaround steps:
  1. Delete the problematic pod where the file system is not mounted. Delete one pod at a time.
  2. Refresh the pod overview page until the pod appears. Run watch oc get pods.
  3. When the problematic pod comes up, set the replicas to 0 in the Spec section in the ibm-spectrum-scale-operator namespace. You must stop the operator deployment immediately. Ensure that no operator pod is running in ibm-spectrum-scale-operator namespace. Command for setting the operator replica to 0:
    oc patch deploy ibm-spectrum-scale-controller-manager \
     --type='json' -n ibm-spectrum-scale-operator \
     -p='[{"op": "replace", "path": "/spec/replicas", "value": 0}]'
    
    oc get deployment -n ibm-spectrum-scale-operator
    Sample output:
    
    I0420 16:56:23.579340    4184 request.go:645] Throttling request took 1.192496025s, request: GET:https://api.isf-rackk.rtp.raleigh.ibm.com:6443/apis/nmstate.io/v1beta1?timeout=32s
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    ibm-spectrum-scale-controller-manager   0/0     0            0           28h
  4. Wait till the problematic pod is in init container state and run the following command:
    oc exec podname -- mmsdrrestore -p <any working core ip/name[Specify pod IP where FS is mounted]>
  5. After the previous step is successful, set the replicas to 1 again in the ibm-spectrum-scale-operator deployment.

    Ensure that the operator pod is running in ibm-spectrum-scale-operator namespace.

    Run the following command for setting the operator replica to 0:
    oc patch deploy ibm-spectrum-scale-controller-manager \
     --type='json' -n ibm-spectrum-scale-operator \
     -p='[{"op": "replace", "path": "/spec/replicas", "value": 1}]'
    oc get deployment -n ibm-spectrum-scale-operator
    Sample output:
    I0420 16:56:23.579340    4184 request.go:645] Throttling request took 1.192496025s, request: GET:https://api.isf-rackk.rtp.raleigh.ibm.com:6443/apis/nmstate.io/v1beta1?timeout=32s
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    ibm-spectrum-scale-controller-manager   1/1     0            0           28h
  6. Wait for some time for the problematic pod to come in running state and file system gets mounted on this problematic node.

IBM Storage Scale user interface does not load properly or IBM Storage Scale CSI pods are in CrashLoopBack state

Problem statement
Sometimes, IBM Storage Scale user interface might not load properly or IBM Storage Scale CSI pods are in CrashLoopBack state.
Resolution
As a workaround, do the following steps:
  1. Restart the GUI pods and check whether it is working properly.
  2. If GUI pod restart does not open the IBM Storage Scale console, then restart the ibm-spectrum-scale-controller-manager pod and then restart GUI pod.

Not able to log in to IBM Storage Scale user interface

Cause
The secret that is mounted in GUI pods /etc/gui-oauth/oauth.properties file and secret present in the ibm-spectrum-scale-gui-oauthclient are different.
gui-1
sh-4.4$ cd /etc/gui-oauth/
sh-4.4$ cat oauth.properties
secret=tjCK6kE0pmCe2d8b7azw
oauthServer=https://oauth-openshift.apps.isf-racka.rtp.raleigh.ibm.com
redirectURI=https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect
clientID=ibm-spectrum-scale-gui-oauthclient
k8sAPIServer=https://172.30.0.1:443
sh-4.4$
gui_0
secret=tjCK6kE0pmCe2d8b7azw
oauthServer=https://oauth-openshift.apps.isf-racka.rtp.raleigh.ibm.com
redirectURI=https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect
clientID=ibm-spectrum-scale-gui-oauthclient
k8sAPIServer=https://172.30.0.1:443
Oauth client config (API Explorer -> OAuthClient -> ibm-spectrum-scale-gui-oauthclient)
kind: OAuthClient
apiVersion: oauth.openshift.io/v1
metadata:
  name: ibm-spectrum-scale-gui-oauthclient
  uid: a9eb9d85-33b9-4cd7-99c8-ae0213032380
  resourceVersion: ‘314932’
  creationTimestamp: ‘2022-03-29T17:09:38Z’
  labels:
    app.kubernetes.io/instance: ibm-spectrum-scale
    app.kubernetes.io/name: gui
  ownerReferences:
apiVersion: scale.spectrum.ibm.com/v1beta1
      kind: Gui
      name: ibm-spectrum-scale-gui
      uid: da23c696-5c71-405b-b9ac-c7d0ccb4dd43
      controller: true
      blockOwnerDeletion: true
  managedFields:
manager: ibm-spectrum-scale/ibm-spectrum-scale-gui
      operation: Apply
      apiVersion: oauth.openshift.io/v1
      time: ‘2022-03-29T17:09:38Z’
      fieldsType: FieldsV1
      fieldsV1:
        ‘f:grantMethod’: {}
        ‘f:metadata’:
          ‘f:labels’:
            ‘f:app.kubernetes.io/instance’: {}
            ‘f:app.kubernetes.io/name’: {}
          ‘f:ownerReferences’:
            ‘k:{“uid”:“da23c696-5c71-405b-b9ac-c7d0ccb4dd43"}’:
              .: {}
              ‘f:apiVersion’: {}
              ‘f:blockOwnerDeletion’: {}
              ‘f:controller’: {}
              ‘f:kind’: {}
              ‘f:name’: {}
              ‘f:uid’: {}
        ‘f:redirectURIs’:
          ‘v:“https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect”’: {}
        ‘f:secret’: {}
*secret: bk2_q1FqwhVhoNqOl7ob*
redirectURIs:
>-
    https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.isf-racka.rtp.raleigh.ibm.com/auth/redirect
grantMethod: auto

Resolution
As a resolution, do the following steps:
  1. As a resolution, restart the GUI pods.
  2. If the previous step (step 1) does not resolve the issue, then restart IBM Storage Scale Operator pod and then restart the GUI pods.

Download log files of all deployments

If you cannot download log files of all deployments, individually download the log files.

Issues in node drain and node restart

Resolution
To resolve such node related issues, see Issues related to IBM Fusion HCI System node drains.

IBM Storage Scale cluster failed to recover after powering off and on the compute nodes

Diagnosis
Identify if the nodes joining the cluster are in an out-of-tolerance situation.
  1. mmhealth reported UNKNOWN RG status on some nodes.
  2. mmrepquota commands have been generated repeatedly, likely by the GUI, which has caused many long waits.
  3. Due to the large number of long waits, the cluster manager node was not responding to any ts cmd other than tsctl nqStatus.

    As a resolution, follow the steps defined in the Recover storage cluster from an unplanned multi-node restart or failure.

MCO rollout is not progressing for more than 5 hours during the scale upgrade

Cause
Whenever the nodes are over committed and no pod limit is available, you might see delays in the policy rollout, ICSP, pull secret, and MCO rollout changes.
Resolution
As a resolution, do the following step:
  • IBM Fusion HCI System nodes allow the maximum number of pods on a node as defined by OpenShift® Container Platform default and ensure to keep that count within the limit. Otherwise, you might see this issue.

IBM Storage Scale cluster status is in a critical state after power on/off the rack

Cause
IBM Storage Scale cluster status is in critical state due to unmounted_fs_check is present on the filesystem CR.
Resolution
As a resolution, wait until this filesystem CR status gets cleared; after that, the IBM Storage Scale cluster status automatically returns to a healthy state. If the issue persists for a long time, then contact IBM support .

Disk upsize issues

  • Manual steps to test the /dev mount:

    After you add a new NVMe disk to the server, the new disk cannot be discovered in the IBM Storage Scale pod. The disks do not show up with /dev/nvme* device names in the POD, IBM Storage Scale need the device name /dev/nvme* to access the disks. Without the /dev/nvme* device name, the new disks cannot be added into recovery group.

    Resolution
    Do the following manual steps to mount disks to /dev in the pod.
    1. Log in to each IBM Storage Scale storage pods.
      oc exec -it <pod name> -c <container> /bin/sh
    2. Stop the GPFS daemons on one or more nodes:
      mmshutdown
    3. Mount /dev from the host running the following command:
      mount -t devtmpfs none /dev
    4. Validate that the content of /dev is from host and you can see sub directories like block, bus, char.
      mmstartup