Global Data Platform service issues
Use this troubleshooting information to resolve install and upgrade problems that are related to Global Data Platform service.
Warning: Do NOT delete IBM Storage Scale pods. Deletion of Scale pods in
many circumstances has implications on availability and data integrity.
Cannot create or update the ScaleManager CR after an IBM Fusion upgrade
- Symptoms
- The Global Data Platform service does not have the
Upgrade button on the IBM Fusion user interface after the IBM Fusion upgrade. It is because the status of
ScaleManager CR cannot be updated. You might see the following error in the
isf-cns-operator
pod logs.Example:2024-06-26T10:48:36.304945925Z 2024-06-26T10:48:36.304Z ERROR Reconciler error {"controller": "scalemanager", "controllerGroup": "cns.isf.ibm.com", "controllerKind": "ScaleManager", "ScaleManager": {"name":"scalemanager","namespace":"ibm-spectrum-fusion-ns"}, "namespace": "ibm-spectrum-fusion-ns", "name": "scalemanager", "reconcileID": "96a5fb8a-06b3-4f20-858d-be7717d0edb1", "error": "Internal error occurred: failed calling webhook \"mscalemanager.fusion.spectrum.ibm.com\": failed to call webhook: Post \"https://isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc:443/mutate-cns-isf-ibm-com-v1-scalemanager?timeout=10s\": x509: certificate is valid for isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns, isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns.svc, not isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc"}
The ScaleManager CR cannot be created after a IBM Fusion upgrade as the Global Data Platform installation gets stuck with the following error in theisf-prereq-operator
pod logs:2024-08-01T01:44:24Z ERROR controllers.prereq.SpectrumFusion Failed to create ScaleManager CR: {"fusionsdscluster": {"name":"spectrumfusion","namespace":"ibm-spectrum-fusion-ns"}, "ibm-spectrum-fusion-ns": " for Spectrum Scale storage", "error": "Internal error occurred: failed calling webhook \"vscalemanager.fusion.spectrum.ibm.com\": failed to call webhook: Post \"https://isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc:443/validate-cns-isf-ibm-com-v1-scalemanager?timeout=10s\": tls: failed to verify certificate: x509: certificate is valid for isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns, isf-cns-operator-controller-manager-service.ibm-spectrum-fusion-ns.svc, not isf-cns-operator-webhook-service.ibm-spectrum-fusion-ns.svc"}
- Cause
- Somehow the webhooks for IBM Fusion HCI System were applied to IBM Fusion software.
- Resolution
-
- In the OpenShift® Container Platform console, go to .
- From the Resources list, select
MutatingWebhookConfiguration
. - Select the Label drop-down list and change it to Name.
- Search for
isf-cns-operator
. Check whether there exists an instance ofisf-cns-operator-mutating-webhook-configuration
webhook. If it exists, take a backup of the older one and delete it. - Go back to page.
- From the Resources, select
ValidatingWebhookConfiguration
. - Search for
isf-cns-operator
and check whether there exists an instance ofisf-cns-operator-validating-webhook-configuration
webhook. If it exists, take the backup of the older one and delete it.
Upgrade does not complete as node pod is in ContainerStatusUnknown
state
- Problem statement
- Global data platform upgrade does not complete because a compute node pod is in
ContainerStatusUnknown
state.
- Workaround
- Delete the pod in
ContainerStatusUnknown
state and restart the compute node.
Pod drain issues in Global data platform upgrade
- Diagnosis and resolution
- If the upgrade of Global data platform gets struck, then do the following diagnose and resolve:
- Go through the logs from the scale operator.
- Check whether you observe the following error:
ERROR Drain error when evicting pods
- If it is a drain error, then check whether the virtual machine's PVC were created ReadWriteOnce.
The PVCs created by using ReadWriteOnce are not shareable or moveable and can cause the draining of the node.
For example:oc get pods NAME READY STATUS RESTARTS AGE compute-0 2/2 Running 0 20h compute-1 2/2 Running 0 24h compute-13 2/2 Running 0 21h compute-14 2/2 Running 3 21h compute-2 2/2 Running 0 21h compute-3 2/2 Running 0 21h control-0 2/2 Running 0 23h control-1 2/2 Running 0 23h control-1-1 2/2 Running 0 23h control1-1-reboot 0/1 ContainerStatusUnknown 1 23h control-2 2/2 Running 0 23h ibm-spectrum-scale-gui-0 4/4 Running 0 20h ibm-spectrum-scale-gui-1 4/4 Running 0 23h ibm-spectrum-scale-pmcollector-0 2/2 Running 0 23h ibm-spectrum-scale-pmcollector-1 2/2 Running 0 20h
- If virtual machine's PVC got created ReadWriteOnce, then stop the virtual machine to continue the upgrade of scale pods.
- If the upgrade fails further, then check whether the reason can be due to an orphaned
control-1-reboot
container left in theibm-spectrum-scale
namespace. Delete the container to resume and complete the installation.
IBM Storage Scale might get stuck during upgrade
- Resolution
- Do the following steps:
- Shut down all applications that use storage.
- Scale down the IBM Storage Scale operator.
Make the replica as 0 for deployment
ibm-spectrum-scale-controller-manager
.- Use
ibm-spectrum-scale-operator
project:oc project ibm-spectrum-scale-operator
- Scale down the IBM Storage Scale
operator.
oc scale --replicas=0 deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator
- Run the following command to confirm resources are not found in
ibm-spectrum-scale-operator
namespace.oc get po
- Use
- Log in to the one of the IBM Storage Scale
core pods and shut down the server.
- Log in to the server:
oc rsh compute-0
- Shut down the server:
mmshutdown -a
Example output:Thu May 19 07:46:47 UTC 2022: mmshutdown: Starting force unmount of GPFS file systems Thu May 19 07:46:52 UTC 2022: mmshutdown: Shutting down GPFS daemons compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! control-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! control-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: 'shutdown' command about to kill process 1364 compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: 'shutdown' command about to kill process 1276 compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: mmfsenv: Module mmfslinux is still in use. compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: mmfsenv: Module mmfslinux is still in use. compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unmount all GPFS file syste .. compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfslinux compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfslinux compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfslinux Thu May 19 07:47:03 UTC 2022: mmshutdown: Finished
- Log in to the server:
- Check whether the states are down:
mmgetstate -a
Sample output:Node number Node name GPFS state --------------------------------------------- 1 control-0-daemon down 2 control-1-daemon down 3 control-2-daemon down 4 compute-14-daemon down 5 compute-13-daemon down 6 compute-0-daemon down 7 compute-1-daemon down 8 compute-2-daemon down
- Delete all the pods in
ibm-spectrum-scale
namespace:oc delete pods --all -n ibm-spectrum-scale
- To verify, get the list of pods:
oc get po -n ibm-spectrum-scale
Sample output:NAME READY STATUS RESTARTS AGE ibm-spectrum-scale-gui-0 0/4 Pending 0 7s ibm-spectrum-scale-pmcollector-0 0/2 Pending 0 7s
Note: Only IBM Storage Scale pods get deleted and not GUI andpmcollector
.
CSI Pods experience scheduling problems post node drain
- Problem statement
- Whenever a compute node is drained, an eviction of CSI PODs or sidecar PODs occur. The remaining available set of compute nodes cannot host the CSI PODs or sidecar PODs because of resource constraints at that point in time.
- Resolution
-
Ensure that a functional system exists with available compute nodes, having sufficient resources to accommodate evicted CSI PODs or sidecar PODs.
Global Data Platform upgrade progress
- Problem statement
- During Global Data Platform upgrade, the progress percentage can go beyond 100% in a few cases. You can ignore this issue, as the percentage comes down to 100 % after the successful completion of the upgrade.
Known issues in upgrade
- During upgrade, the IBM Fusion rack user interface, IBM Storage Scale user interface, Grafana endpoint, and Applications are not reachable for sometime.
- During the upgrade, an intermittent error
-1 node updated successfully out of 5 nodes
shows on the IBM Fusion user interface in the upgrade details page. - During the upgrade, an intermittent error shows on the IBM Fusion user interface that the progress percentage for the Global Data Platform decreased. It occurs, especially when the upgrade for one node is completed.
- During the upgrade, an intermittent error
Global Data Platform upgrade failed
shows on the IBM Fusion user interface in the upgrade details page. Ignore the error because it might recover during the next reconciliation.