Issues related to IBM Fusion HCI System node drains
Use these common troubleshooting tips and tricks when you work with IBM Fusion HCI System.
Issues related to draining IBM Fusion HCI System node
-
- Configuration updates
-
- Red Hat® OpenShift® Machine Config Operator (MCO)
- IBM Fusion component specification changes
-
- Upgrade
-
- OpenShift Container Platform upgrades
- IBM Fusion software upgrades
- Firmware upgrades
-
- User Initiated
-
- Maintenance operation
- kdump enabling/disabling
- Operations that result in node restart
- Node drain
- Cause
- The PDB configuration is set by Scale to control the node drain so that the Scale cluster remains healthy always, and you can drain only an allowed quantity of nodes at any time. For more details, see Scale behavior during node restarts.
- Resolution
- When node drain hangs due to any of the suggested operations, then do the following steps:
- Run the following command to identify the pod that causes the
issue.
The command lists pods pending for eviction.oc describe cmt <target node name> -n ibm-spectrum-fusion-ns
- Go through events of the node having the issue:
- Scenario where drains are prevented by an application
- If drains are prevented by an application, consult the owner of the application to proceed with the drain, and manually drain the node. For more information about the issue and to troubleshoot, see Identifying applications preventing cluster maintenance.
- Scenario where drains are prevented by VM
- Check Identifying applications preventing cluster maintenance to determine the node that is waiting for reboot and check whether live migration is set up properly to allow VM to migrate to a different node. For more details about setting up live migrate, see Virtual machine live migration.
- Scenarios where issue is due to node maintenance
- The maintenance mode on a node can take from four minutes to thirty minutes to succeed. If the
maintenance mode operation is taking more than 30 minutes, then Fusion continues to retry draining
the pods with a warning event BMYCO0012, and gets timed out eventually. If you want to stop the
retries, then delete the
Computemaintenance
CR instance.Run the following command to deleteComputemaintenance
CR instance.oc delete cmt <instnace-nodename> -n ibm-spectrum-fusion-ns
- If it takes a long time and you want to continue the operation, check the pod that is pending by using the oc get pods command. If the pod name belongs to IBM Storage Scale, see Identifying applications preventing cluster maintenance. If issue persists, collect Scale system health and Scale logs, and contact IBM support.
- The maintenance mode on a node can take from four minutes to thirty minutes to succeed. If the
maintenance mode operation is taking more than 30 minutes, then Fusion continues to retry draining
the pods with a warning event BMYCO0012, and gets timed out eventually. If you want to stop the
retries, then delete the
- Run the following command to identify the pod that causes the
issue.
Scale behavior during node restarts
Restart of a node can happen due to various reasons such as an MCO roll out, user initiated, or as part of firmware or software upgrades. Scale has a maximum tolerance on the number of nodes that are in unavailable or in not ready state. It is based on the erasure code that is configured for the storage cluster. To stay within this tolerance, Scale uses the POD Disruption Budget (PDB) feature of OpenShift.
Scale CNSA implements PDB with a "maxUnavailable=0", which means that it does not allow a Scale pod to go down without its knowledge. Even with a "maxUnavailable=0" PDB, the design does allow for cluster updates. In this configuration, MCO is prevented from taking down the node and draining the CNSA core pod. The Scale CNSA pod detects the operation, drains the applications, and then exits itself, thus freeing the node up for the operation to continue.
If any application refuses to unmount, or if the scale core pod itself determines the cluster, then it goes into a bad or hung state. If it goes down, then things pause until the condition gets resolved.
For more information about identifying applications that prevent cluster maintenance, see Identifying applications preventing cluster maintenance.