Global Data Platform service issues

Use this troubleshooting information to resolve install and upgrade problems that are related to Global Data Platform service.

Warning: Do NOT delete IBM Storage Scale pods. Deletion of Scale pods in many circumstances has implications on availability and data integrity.

Cannot create or update the ScaleManager CR after an IBM Fusion upgrade

Symptoms
The Global Data Platform service does not have the Upgrade button on the IBM Fusion user interface after the IBM Fusion upgrade. It is because the status of ScaleManager CR cannot be updated. You might see the following error in the isf-cns-operator pod logs.
Example:
2024-06-26T10:48:36.304945925Z 2024-06-26T10:48:36.304Z	ERROR	Reconciler error	{"controller": "scalemanager", "controllerGroup": "cns.isf.ibm.com", "controllerKind": "ScaleManager", "ScaleManager": {"name":"scalemanager","namespace":"ibm-spectrum-fusion-ns"}, "namespace": "ibm-spectrum-fusion-ns", "name": "scalemanager", "reconcileID": "96a5fb8a-06b3-4f20-858d-be7717d0edb1", "error": "Internal error occurred: failed calling webhook \"mscalemanager.fusion.spectrum.ibm.com\": failed to call webhook: Post \"https://isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc:443/mutate-cns-isf-ibm-com-v1-scalemanager?timeout=10s\": x509: certificate is valid for isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns, isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns.svc, not isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc"}
The ScaleManager CR cannot be created after a IBM Fusion upgrade as the Global Data Platform installation gets stuck with the following error in the isf-prereq-operator pod logs:
2024-08-01T01:44:24Z	ERROR	controllers.prereq.SpectrumFusion	Failed to create ScaleManager CR:	{"fusionsdscluster": {"name":"spectrumfusion","namespace":"ibm-spectrum-fusion-ns"}, "ibm-spectrum-fusion-ns": " for Spectrum Scale storage", "error": "Internal error occurred: failed calling webhook \"vscalemanager.fusion.spectrum.ibm.com\": failed to call webhook: Post \"https://isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc:443/validate-cns-isf-ibm-com-v1-scalemanager?timeout=10s\": tls: failed to verify certificate: x509: certificate is valid for isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns, isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns.svc, not isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc"}
Cause
Somehow the webhooks for IBM Fusion HCI System were applied to IBM Fusion software.
Resolution
  1. In the OpenShift® Container Platform console, go to Home > Search.
  2. From the Resources list, select MutatingWebhookConfiguration.
  3. Select the Label drop-down list and change it to Name.
  4. Search for isf-cns-operator. Check whether there exists an instance of isf-cns-operator-mutating-webhook-configuration webhook. If it exists, take a backup of the older one and delete it.
  5. Go back to Home > Search page.
  6. From the Resources, select ValidatingWebhookConfiguration.
  7. Search for isf-cns-operator and check whether there exists an instance of isf-cns-operator-validating-webhook-configuration webhook. If it exists, take the backup of the older one and delete it.

Upgrade does not complete as node pod is in ContainerStatusUnknown state

Problem statement
Global data platform upgrade does not complete because a compute node pod is in ContainerStatusUnknown state.
Workaround
Delete the pod in ContainerStatusUnknown state and restart the compute node.

Pod drain issues in Global data platform upgrade

Diagnosis and resolution
If the upgrade of Global data platform gets struck, then do the following diagnose and resolve:
  1. Go through the logs from the scale operator.
  2. Check whether you observe the following error:

    ERROR Drain error when evicting pods

  3. If it is a drain error, then check whether the virtual machine's PVC were created ReadWriteOnce.

    The PVCs created by using ReadWriteOnce are not shareable or moveable and can cause the draining of the node.

    For example:
    oc get pods
    
    NAME                              READY        STATUS                 RESTARTS      AGE
    
    compute-0                         2/2         Running                  0             20h
    compute-1                         2/2         Running                  0             24h
    compute-13                        2/2         Running                  0             21h
    compute-14                        2/2         Running                  3             21h
    compute-2                         2/2         Running                  0             21h
    compute-3                         2/2         Running                  0             21h
    control-0                         2/2         Running                  0             23h
    control-1                         2/2         Running                  0             23h
    control-1-1                       2/2         Running                  0             23h
    control1-1-reboot                 0/1         ContainerStatusUnknown   1             23h
    control-2                         2/2         Running                  0             23h
    ibm-spectrum-scale-gui-0          4/4         Running                  0             20h
    ibm-spectrum-scale-gui-1          4/4         Running                  0             23h
    ibm-spectrum-scale-pmcollector-0  2/2         Running                  0             23h
    ibm-spectrum-scale-pmcollector-1  2/2         Running                  0             20h
    
  4. If virtual machine's PVC got created ReadWriteOnce, then stop the virtual machine to continue the upgrade of scale pods.
  5. If the upgrade fails further, then check whether the reason can be due to an orphaned control-1-reboot container left in the ibm-spectrum-scale namespace. Delete the container to resume and complete the installation.

IBM Storage Scale might get stuck during upgrade

Resolution
Do the following steps:
  1. Shut down all applications that use storage.
  2. Scale down the IBM Storage Scale operator. Make the replica as 0 for deployment ibm-spectrum-scale-controller-manager.
    1. Use ibm-spectrum-scale-operator project:
      oc project ibm-spectrum-scale-operator
    2. Scale down the IBM Storage Scale operator.
      oc scale --replicas=0 deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator
    3. Run the following command to confirm resources are not found in ibm-spectrum-scale-operator namespace.
      oc get po
  3. Log in to the one of the IBM Storage Scale core pods and shut down the server.
    1. Log in to the server:
      oc rsh compute-0
    2. Shut down the server:
      mmshutdown -a
      Example output:
      Thu May 19 07:46:47 UTC 2022: mmshutdown: Starting force unmount of GPFS file systems
      Thu May 19 07:46:52 UTC 2022: mmshutdown: Shutting down GPFS daemons
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      control-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      control-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Shutting down!
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  'shutdown' command about to kill process 1364
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  'shutdown' command about to kill process 1276
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  mmfsenv: Module mmfslinux is still in use.
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  mmfsenv: Module mmfslinux is still in use.
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unmount all GPFS file syste
      ..
      compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfs26
      control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfslinux
      compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfslinux
      compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local:  Unloading module mmfslinux
      Thu May 19 07:47:03 UTC 2022: mmshutdown: Finished
  4. Check whether the states are down:
    mmgetstate -a
    Sample output:
    Node number  Node name          GPFS state  
    ---------------------------------------------
               1  control-0-daemon   down
               2  control-1-daemon   down
               3  control-2-daemon   down
               4  compute-14-daemon  down
               5  compute-13-daemon  down
               6  compute-0-daemon   down
               7  compute-1-daemon   down
               8  compute-2-daemon   down
    
  5. Delete all the pods in ibm-spectrum-scale namespace:
    oc delete pods --all -n ibm-spectrum-scale
  6. To verify, get the list of pods:
    oc get po -n ibm-spectrum-scale
    Sample output:
    NAME                              READY  STATUS   RESTARTS  AGE
     ibm-spectrum-scale-gui-0          0/4    Pending  0         7s
     ibm-spectrum-scale-pmcollector-0  0/2    Pending  0         7s
    Note: Only IBM Storage Scale pods get deleted and not GUI and pmcollector.

CSI Pods experience scheduling problems post node drain

Problem statement
Whenever a compute node is drained, an eviction of CSI PODs or sidecar PODs occur. The remaining available set of compute nodes cannot host the CSI PODs or sidecar PODs because of resource constraints at that point in time.
Resolution

Ensure that a functional system exists with available compute nodes, having sufficient resources to accommodate evicted CSI PODs or sidecar PODs.

Global Data Platform upgrade progress

Problem statement
During Global Data Platform upgrade, the progress percentage can go beyond 100% in a few cases. You can ignore this issue, as the percentage comes down to 100 % after the successful completion of the upgrade.

Known issues in upgrade

  • During upgrade, the IBM Fusion rack user interface, IBM Storage Scale user interface, Grafana endpoint, and Applications are not reachable for sometime.
  • During the upgrade, an intermittent error -1 node updated successfully out of 5 nodes shows on the IBM Fusion user interface in the upgrade details page.
  • During the upgrade, an intermittent error shows on the IBM Fusion user interface that the progress percentage for the Global Data Platform decreased. It occurs, especially when the upgrade for one node is completed.
  • During the upgrade, an intermittent error Global Data Platform upgrade failed shows on the IBM Fusion user interface in the upgrade details page. Ignore the error because it might recover during the next reconciliation.