Global Data Platform service install and upgrade issues

Use this troubleshooting information to resolve install and upgrade problems that are related to Global Data Platform service.

Warning: Do NOT delete IBM Storage Scale pods. Deletion of Scale pods in many circumstances has implications on availability and data integrity.

IBM Storage Scale stuck with error

Problem statement
If IBM Storage Scale installation stuck with the error ERROR Failed to define vdiskset {"controller": "filesystem", "controllerGroup": "scale.spectrum.ibm.com", "controllerKind": "Filesystem", "Filesystem": {"name":"ibmspectrum-fs","namespace":"ibm-spectrum-scale"}, "namespace": "ibm-spectrum-scale", "name": "ibmspectrum-fs", "reconcileID": "5fcf21a4-b488-44db-8db5-0f3c1ab695cd", "cmd": "/usr/lpp/mmfs/bin/mmvdisk vs define --vs ibmspectrum-fs-0 --rg rg1 --code 4+2P --bs 4M -
Resolution
As a resolution, run the following command to manually create a recovery group.
mmvdisk recoverygroup create --recovery-group <RG name> --node-class <Node class name>

Upgrade does not complete as node pod is in ContainerStatusUnknown state

Problem statement
Global data platform upgrade does not complete because a compute node pod is in ContainerStatusUnknown state.
Workaround
Delete the pod in ContainerStatusUnknown state and restart the compute node.

Pod drain issues in Global data platform upgrade

Diagnosis and resolution
If the upgrade of Global data platform gets struck, then do the following diagnose and resolve:
  1. Go through the logs from the scale operator.
  2. Check whether you observe the following error:

    ERROR Drain error when evicting pods

  3. If it is a drain error, then check whether the virtual machine's PVC were created ReadWriteOnce.

    The PVCs created by using ReadWriteOnce are not shareable or moveable and can cause the draining of the node.

    For example:
    oc get pods
    
    NAME                              READY        STATUS                 RESTARTS      AGE
    
    compute-0                         2/2         Running                  0             20h
    compute-1                         2/2         Running                  0             24h
    compute-13                        2/2         Running                  0             21h
    compute-14                        2/2         Running                  3             21h
    compute-2                         2/2         Running                  0             21h
    compute-3                         2/2         Running                  0             21h
    control-0                         2/2         Running                  0             23h
    control-1                         2/2         Running                  0             23h
    control-1-1                       2/2         Running                  0             23h
    control1-1-reboot                 0/1         ContainerStatusUnknown   1             23h
    control-2                         2/2         Running                  0             23h
    ibm-spectrum-scale-gui-0          4/4         Running                  0             20h
    ibm-spectrum-scale-gui-1          4/4         Running                  0             23h
    ibm-spectrum-scale-pmcollector-0  2/2         Running                  0             23h
    ibm-spectrum-scale-pmcollector-1  2/2         Running                  0             20h
    
  4. If virtual machine's PVC got created ReadWriteOnce, then stop the virtual machine to continue the upgrade of scale pods.
  5. If the upgrade fails further, then check whether the reason can be due to an orphaned control-1-reboot container left in the ibm-spectrum-scale namespace. Delete the container to resume and complete the installation.

IBM Storage Scale might get stuck during upgrade

Resolution
Do the following steps:
  1. Shut down all applications that use storage.
  2. Scale down the IBM Storage Scale operator. Make the replica as 0 for deployment ibm-spectrum-scale-controller-manager.
    1. Use ibm-spectrum-scale-operator project:
      oc project ibm-spectrum-scale-operator
    2. Scale down the IBM Storage Scale operator.
      oc scale --replicas=0 deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator
    3. Run the following command to confirm resources are not found in ibm-spectrum-scale-operator namespace.
      oc get po
  3. Log in to the one of the IBM Storage Scale core pods and shut down the server.
    1. Log in to the server:
      oc rsh compute-0
    2. Shut down the server:
      mmshutdown -a
      Example output:
      Thu May 19 07:46:47 UTC 2022: mmshutdown: Starting force unmount of GPFS file systems
      Thu May 19 07:46:52 UTC 2022: mmshutdown: Shutting down GPFS daemons
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      control-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      control-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  'shutdown' command about to kill process 1364
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  'shutdown' command about to kill process 1276
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  mmfsenv: Module mmfslinux is still in use.
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  mmfsenv: Module mmfslinux is still in use.
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unmount all GPFS file syste
      ..
      compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfslinux
      compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfslinux
      compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfslinux
      Thu May 19 07:47:03 UTC 2022: mmshutdown: Finished
  4. Check whether the states are down:
    mmgetstate -a
    Sample output:
    Node number  Node name          GPFS state  
    ---------------------------------------------
               1  control-0-daemon   down
               2  control-1-daemon   down
               3  control-2-daemon   down
               4  compute-14-daemon  down
               5  compute-13-daemon  down
               6  compute-0-daemon   down
               7  compute-1-daemon   down
               8  compute-2-daemon   down
    
  5. Delete all the pods in ibm-spectrum-scale namespace:
    oc delete pods --all -n ibm-spectrum-scale
  6. To verify, get the list of pods:
    oc get po -n ibm-spectrum-scale
    Sample output:
    NAME                              READY  STATUS   RESTARTS  AGE
     ibm-spectrum-scale-gui-0          0/4    Pending  0         7s
     ibm-spectrum-scale-pmcollector-0  0/2    Pending  0         7s
    Note: Only IBM Storage Scale pods get deleted and not GUI and pmcollector.

IBM Storage Scale upgrade gets stuck

Cause
The IBM Storage Scale upgrade gets stuck because of isf-storage-service. It happens whenever the scaleoutstatus or scaleupStatus is In Progress in the Scale CR.
Important: Do not update the latest image for the isf-storage-service when a IBM Storage Scale operation is in progress.
Workaround
Steps to upgrade the IBM Storage Scale on the rack:
  1. Go to the isf-storage-service in ibm-spectrum-fusion-ns project and change the image to cp.icr.io/cp/isf/isf-storage-services:2.2.0-latest or run the following OC command.
    oc patch deploy isf-storage-service-dep -n ibm-spectrum-fusion-ns --patch='{"spec":{"template":{"spec":{"containers":[{"name": "app", "image":"cp.icr.io/cp/isf/isf-storage-services:2.2.0-latest"}]}}}}
  2. After the isf-storage-service pod goes to running state and points to the cp.icr.io/cp/isf/isf-storage-services:2.2.0-latest version, run the following API endpoint command in the isf-storage-service pod's terminal:
    curl -k https://isf-scale-svc/api/v1/upgradeExcludingOperator
    You can observe the new logs for isf-storage-service as well.
  3. Set the Edition to erasure-code in the Daemon CR:
    oc patch daemon ibm-spectrum-scale \
     --type='json' -n ibm-spectrum-scale \
     -p="[{'op': 'replace', 'path': '/spec/edition', 'value': "erasure-code"}]" 
  4. Run the following curl command to deploy the operator on the Red Hat® OpenShift® Container Platform cluster:
    curl -k https://isf-scale-svc/api/v1/upgradeWithOperator
    You can observe new logs for isf-storage-service.
  5. Wait for sometime and check whether the status of IBM Storage Scale core pod is in restart or running state in ibm-spectrum-scale.

CSI Pods experience scheduling problems post node drain

Problem statement
Whenever a compute node is drained, an eviction of CSI PODs or sidecar PODs occur. The remaining available set of compute nodes cannot host the CSI PODs or sidecar PODs because of resource constraints at that point in time.
Resolution

Ensure that a functional system exists with available compute nodes, having sufficient resources to accommodate evicted CSI PODs or sidecar PODs.

Global Data Platform upgrade progress

Problem statement
During Global Data Platform upgrade, the progress percentage can go beyond 100% in a few cases. You can ignore this issue, as the percentage comes down to 100 % after the successful completion of the upgrade.

Scale installation might get stuck with ECE pods in init state

Problem statement
The Scale installation might get stuck with ECE pods in init state with the following error:
The node already belongs to a GPFS cluster.
Resolution
  1. Check whether the daemon-network IP of the pod is pingable.
  2. If it is not pingable, then restart all available TORs.
  3. After you restart TORs, check whether daemon-network IP is pingable.
  4. Run the following command to manually clean up the nodes:
    For the given worker node, run the following oc command:
    oc debug node/<openshift_worker_node> -T -- chroot /host sh -c "rm -rf /var/mmfs"
  5. Kill the pod and wait for it to come up again. The Scale installation resumes after all the ECE pods goes to running state.

Expansion rack fails for 4+2p setup

Problem statement
Sometimes, in a high-availability cluster, the expansion rack fails for 4+2p setup.
Resolution
  • Do the following workaround steps to resolve the issue:
    1. In the OpenShift Container Platform user interface, go to Administration > CustomResourceDefinitions > Scale > Instances > storagemanager > Yaml and check whether the scaleoutStatus: IN-PROGRESS
    2. If scaleoutStatus: IN-PROGRESS, go to Administration > CustomResourceDefinitions > RecoveryGroup > Instances and check for the created RecoveryGroups.
    3. If more than two recovery groups are created, do the following actions:
      1. Go to Administration > CustomResourceDefinitions > RecoveryGroup > Instances > ..
      2. Find RecoveryGroup instance with suffix rg2.
      3. Click the ellipsis menu for that instance and select the Delete RecoveryGroup option.
    4. Do the following checks to validate whether the node upsize completed successfully:
      • In the IBM Fusion user interface, go to the IBM Storage Scale user interface by using the app switcher.
      • Click Recovery groups on the IBM Spectrum Scale RAID tile.
      • Click the entry of Recovery group server nodes.
      • Check whether all the newly added nodes for expansion rack are listed and is in Healthy state.
      • Create a sample PVC to check whether the storage can be provisioned.
    5. Check for the labels on each recovery group.
      1. In the OpenShift Container Platform console, go to Administration > CustomResourceDefinitions > RecoveryGroup > Instances > Yaml.
      2. Check whether the YAML contains either of these two labels:
        • scale.spectrum.ibm.com/scale: up
        • scale.spectrum.ibm.com/scale: out
      3. If it is yes for the previous step, remove those labels from each recovery group and click Save.

Known issues in installation

  • The display unit of filesystem block size in storagesetupcr is MiB and that of ibmspectrum-fs is M. As IBM Storage Scale also uses MiB format internally, you can ignore this inconsistency.

Known issues in upgrade

  • During upgrade, the IBM Fusion rack user interface, IBM Storage Scale user interface, Grafana endpoint, and Applications are not reachable for sometime.
  • During the upgrade, an intermittent error -1 node updated successfully out of 5 nodes shows on the IBM Fusion user interface in the upgrade details page.
  • During the upgrade, an intermittent error shows on the IBM Fusion user interface that the progress percentage for the Global Data Platform decreased. It occurs, especially when the upgrade for one node is completed.
  • During the upgrade, an intermittent error Global Data Platform upgrade failed shows on the IBM Fusion user interface in the upgrade details page. Ignore the error because it might recover during the next reconciliation.