Global Data Platform service install and upgrade issues
Use this troubleshooting information to resolve install and upgrade problems that are related to Global Data Platform service.
Warning: Do NOT delete IBM Storage Scale pods. Deletion of Scale pods in
many circumstances has implications on availability and data integrity.
IBM Storage Scale stuck with error
- Problem statement
- If IBM Storage Scale installation stuck with the error ERROR Failed to define vdiskset {"controller": "filesystem", "controllerGroup": "scale.spectrum.ibm.com", "controllerKind": "Filesystem", "Filesystem": {"name":"ibmspectrum-fs","namespace":"ibm-spectrum-scale"}, "namespace": "ibm-spectrum-scale", "name": "ibmspectrum-fs", "reconcileID": "5fcf21a4-b488-44db-8db5-0f3c1ab695cd", "cmd": "/usr/lpp/mmfs/bin/mmvdisk vs define --vs ibmspectrum-fs-0 --rg rg1 --code 4+2P --bs 4M -
- Resolution
- As a resolution, run the following command to manually create a recovery
group.
mmvdisk recoverygroup create --recovery-group <RG name> --node-class <Node class name>
Upgrade does not complete as node pod is in ContainerStatusUnknown
state
- Problem statement
- Global data platform upgrade does not complete because a compute node pod is in
ContainerStatusUnknown
state.
- Workaround
- Delete the pod in
ContainerStatusUnknown
state and restart the compute node.
Pod drain issues in Global data platform upgrade
- Diagnosis and resolution
- If the upgrade of Global data platform gets struck, then do the following diagnose and resolve:
- Go through the logs from the scale operator.
- Check whether you observe the following error:
ERROR Drain error when evicting pods
- If it is a drain error, then check whether the virtual machine's PVC were created ReadWriteOnce.
The PVCs created by using ReadWriteOnce are not shareable or moveable and can cause the draining of the node.
For example:oc get pods NAME READY STATUS RESTARTS AGE compute-0 2/2 Running 0 20h compute-1 2/2 Running 0 24h compute-13 2/2 Running 0 21h compute-14 2/2 Running 3 21h compute-2 2/2 Running 0 21h compute-3 2/2 Running 0 21h control-0 2/2 Running 0 23h control-1 2/2 Running 0 23h control-1-1 2/2 Running 0 23h control1-1-reboot 0/1 ContainerStatusUnknown 1 23h control-2 2/2 Running 0 23h ibm-spectrum-scale-gui-0 4/4 Running 0 20h ibm-spectrum-scale-gui-1 4/4 Running 0 23h ibm-spectrum-scale-pmcollector-0 2/2 Running 0 23h ibm-spectrum-scale-pmcollector-1 2/2 Running 0 20h
- If virtual machine's PVC got created ReadWriteOnce, then stop the virtual machine to continue the upgrade of scale pods.
- If the upgrade fails further, then check whether the reason can be due to an orphaned
control-1-reboot
container left in theibm-spectrum-scale
namespace. Delete the container to resume and complete the installation.
IBM Storage Scale might get stuck during upgrade
- Resolution
- Do the following steps:
- Shut down all applications that use storage.
- Scale down the IBM Storage Scale operator.
Make the replica as 0 for deployment
ibm-spectrum-scale-controller-manager
.- Use
ibm-spectrum-scale-operator
project:oc project ibm-spectrum-scale-operator
- Scale down the IBM Storage Scale
operator.
oc scale --replicas=0 deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator
- Run the following command to confirm resources are not found in
ibm-spectrum-scale-operator
namespace.oc get po
- Use
- Log in to the one of the IBM Storage Scale
core pods and shut down the server.
- Log in to the server:
oc rsh compute-0
- Shut down the server:
mmshutdown -a
Example output:Thu May 19 07:46:47 UTC 2022: mmshutdown: Starting force unmount of GPFS file systems Thu May 19 07:46:52 UTC 2022: mmshutdown: Shutting down GPFS daemons compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! control-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! control-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-1.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Shutting down! compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: 'shutdown' command about to kill process 1364 compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: 'shutdown' command about to kill process 1276 compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading modules from /lib/modules/4.18.0-305.19.1.el8_4.x86_64/extra compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: mmfsenv: Module mmfslinux is still in use. compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: mmfsenv: Module mmfslinux is still in use. compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 compute-13.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 compute-14.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unmount all GPFS file syste .. compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfs26 control-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfslinux compute-0.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfslinux compute-2.ibm-spectrum-scale-core.ibm-spectrum-scale.svc.cluster.local: Unloading module mmfslinux Thu May 19 07:47:03 UTC 2022: mmshutdown: Finished
- Log in to the server:
- Check whether the states are down:
mmgetstate -a
Sample output:Node number Node name GPFS state --------------------------------------------- 1 control-0-daemon down 2 control-1-daemon down 3 control-2-daemon down 4 compute-14-daemon down 5 compute-13-daemon down 6 compute-0-daemon down 7 compute-1-daemon down 8 compute-2-daemon down
- Delete all the pods in
ibm-spectrum-scale
namespace:oc delete pods --all -n ibm-spectrum-scale
- To verify, get the list of pods:
oc get po -n ibm-spectrum-scale
Sample output:NAME READY STATUS RESTARTS AGE ibm-spectrum-scale-gui-0 0/4 Pending 0 7s ibm-spectrum-scale-pmcollector-0 0/2 Pending 0 7s
Note: Only IBM Storage Scale pods get deleted and not GUI andpmcollector
.
IBM Storage Scale upgrade gets stuck
- Cause
- The IBM Storage Scale upgrade gets stuck because of
isf-storage-service
. It happens whenever thescaleoutstatus
orscaleupStatus
is In Progress in the Scale CR.Important: Do not update the latest image for theisf-storage-service
when a IBM Storage Scale operation is in progress.
- Workaround
- Steps to upgrade the IBM Storage Scale on the rack:
- Go to the
isf-storage-service
inibm-spectrum-fusion-ns
project and change the image tocp.icr.io/cp/isf/isf-storage-services:2.2.0-latest
or run the following OC command.oc patch deploy isf-storage-service-dep -n ibm-spectrum-fusion-ns --patch='{"spec":{"template":{"spec":{"containers":[{"name": "app", "image":"cp.icr.io/cp/isf/isf-storage-services:2.2.0-latest"}]}}}}
- After the
isf-storage-service
pod goes to running state and points to thecp.icr.io/cp/isf/isf-storage-services:2.2.0-latest
version, run the following API endpoint command in theisf-storage-service
pod's terminal:
You can observe the new logs forcurl -k https://isf-scale-svc/api/v1/upgradeExcludingOperator
isf-storage-service
as well. - Set the Edition to
erasure-code
in theDaemon CR
:oc patch daemon ibm-spectrum-scale \ --type='json' -n ibm-spectrum-scale \ -p="[{'op': 'replace', 'path': '/spec/edition', 'value': "erasure-code"}]"
- Run the following curl command to deploy the operator on the Red Hat®
OpenShift® Container Platform cluster:
You can observe new logs forcurl -k https://isf-scale-svc/api/v1/upgradeWithOperator
isf-storage-service
. - Wait for sometime and check whether the status of IBM Storage Scale core pod is in restart or running
state in
ibm-spectrum-scale
.
- Go to the
CSI Pods experience scheduling problems post node drain
- Problem statement
- Whenever a compute node is drained, an eviction of CSI PODs or sidecar PODs occur. The remaining available set of compute nodes cannot host the CSI PODs or sidecar PODs because of resource constraints at that point in time.
- Resolution
-
Ensure that a functional system exists with available compute nodes, having sufficient resources to accommodate evicted CSI PODs or sidecar PODs.
Global Data Platform upgrade progress
- Problem statement
- During Global Data Platform upgrade, the progress percentage can go beyond 100% in a few cases. You can ignore this issue, as the percentage comes down to 100 % after the successful completion of the upgrade.
Scale installation might get stuck with ECE pods in init
state
- Problem statement
- The Scale installation might get stuck with ECE pods in
init
state with the following error:The node already belongs to a GPFS cluster.
- Resolution
-
- Check whether the daemon-network IP of the pod is pingable.
- If it is not pingable, then restart all available TORs.
- After you restart TORs, check whether daemon-network IP is pingable.
- Run the following command to manually clean up the nodes: For the given worker node, run the following oc command:
oc debug node/<openshift_worker_node> -T -- chroot /host sh -c "rm -rf /var/mmfs"
- Kill the pod and wait for it to come up again. The Scale installation resumes after all the ECE pods goes to running state.
Expansion rack fails for 4+2p setup
- Problem statement
- Sometimes, in a high-availability cluster, the expansion rack fails for 4+2p setup.
- Resolution
-
- Do the following workaround steps to resolve the issue:
- In the OpenShift Container Platform user interface, go to
and check whether the
scaleoutStatus: IN-PROGRESS
- If
scaleoutStatus: IN-PROGRESS
, go to and check for the created RecoveryGroups. - If more than two recovery groups are created, do the following actions:
- Go to .
- Find
RecoveryGroup
instance with suffix rg2. - Click the ellipsis menu for that instance and select the Delete RecoveryGroup option.
- Do the following checks to validate whether the node upsize completed successfully:
- In the IBM Fusion user interface, go to the IBM Storage Scale user interface by using the app switcher.
- Click Recovery groups on the IBM Spectrum Scale RAID tile.
- Click the entry of Recovery group server nodes.
- Check whether all the newly added nodes for expansion rack are listed and is in Healthy state.
- Create a sample PVC to check whether the storage can be provisioned.
- Check for the labels on each recovery group.
- In the OpenShift Container Platform console, go to .
- Check whether the YAML contains either of these two labels:
scale.spectrum.ibm.com/scale: up
scale.spectrum.ibm.com/scale: out
- If it is yes for the previous step, remove those labels from each recovery group and click Save.
- In the OpenShift Container Platform user interface, go to
and check whether the
- Do the following workaround steps to resolve the issue:
Known issues in installation
- The display unit of filesystem block size in
storagesetupcr
is MiB and that ofibmspectrum-fs
is M. As IBM Storage Scale also uses MiB format internally, you can ignore this inconsistency.
Known issues in upgrade
- During upgrade, the IBM Fusion rack user interface, IBM Storage Scale user interface, Grafana endpoint, and Applications are not reachable for sometime.
- During the upgrade, an intermittent error
-1 node updated successfully out of 5 nodes
shows on the IBM Fusion user interface in the upgrade details page. - During the upgrade, an intermittent error shows on the IBM Fusion user interface that the progress percentage for the Global Data Platform decreased. It occurs, especially when the upgrade for one node is completed.
- During the upgrade, an intermittent error
Global Data Platform upgrade failed
shows on the IBM Fusion user interface in the upgrade details page. Ignore the error because it might recover during the next reconciliation.