Upgrade recovery
Use this information to recover from a failed upgrade.
A failed upgrade might leave a cluster in a state of containing multiple code levels. It is important to analyze console output to determine which nodes or components were upgraded prior to the failure and which node or component was in the process of being upgraded when the failure occurred.
Once the problem has been isolated, a healthy cluster state must be achieved prior to continuing the upgrade. Use the mmhealth command in addition to the mmces state show -a command to verify that all services are up. It might be necessary to manually start services that were down when the upgrade failed. Starting the services manually helps achieve a state in which all components are healthy prior to continuing the upgrade.
For more information about verifying service status, see mmhealth command and mmces state show command.
Upgrade failure recovery when using the installation toolkit
If a failure occurs during an upgrade that is being done with the installation toolkit, examine the log at the location that is provided in the output from the installation toolkit to determine the cause. More information on each error is provided in the output from the installation toolkit.
Certain failures cause the upgrade process to stop. In this case, the cause of the failure must be addressed on the node on which it occurred. After the problem is addressed, the upgrade command can be run again. Running the upgrade command again does not affect any nodes that were upgraded successfully and the upgrade process continues from the point of failure in most cases.
Examples of failure during upgrade are:
- A protocol is enabled in the configuration file, but is not running on a node
- Using the mmces service list command on the node that failed
highlights which process is not running. The output from the installation toolkit also reports which component failed. Ensure that
the component is started on all nodes with mmces service start nfs | smb |
obj, or alternatively disable the protocol in the configuration by using
./spectrumscale disable nfs|smb|object if the component was intentionally
stopped.CAUTION:
When a protocol is disabled, the protocol is stopped on all protocol nodes in the cluster and all protocol-specific configuration data is removed.
- CES cannot be resumed on a node due to CTDB version mismatch
- During an upgrade run, the CES-resume operation might fail on a node because the CTDB service
startup fails. The CTDB service startup fails because the SMB version on one or more nodes is
different from the SMB version on other nodes that are forming an active CTDB cluster. In this case,
do the following steps:
- From the upgrade logs, determine the nodes on which the SMB version is different and designate those nodes as offline in the upgrade configuration.
- Do an upgrade rerun to complete the upgrade.
- Remove the offline designation of the nodes and manually resume CES on those nodes.
- Upgrade recovery for HDFS not supported
- CES also supports HDFS protocols. However, upgrade recovery is not supported for HDFS protocols.