Automatic recovery
The IBM Storage Scale recovers itself from certain issues without manual intervention.
- Failover of CES IP addresses to recover from node failures. That is, if any important service or
protocol service is broken on a node, the system changes the status of that node to
Failed and moves the public IPs to healthy nodes in the cluster. A failover gets triggered due to the following conditions:
- If the IBM Storage Scale monitoring service detects a critical problem in any of the CES components such as NFS, SMB, or OBJ, then the CES state is set to FAILED and it triggers a failover.
- If the IBM Storage Scale daemon detects a problem with the node or cluster such as expel node, or quorum loss, then it runs callbacks and a failover is triggered.
- The CES framework also triggers a failover during the distribution of IP addresses as specified in the distribution policy.
- If there are any errors with the SMB and Object protocol services, the system restarts the
corresponding daemons. If restarting the protocol service daemons does not resolve the issue and the
maximum retry count is reached, the system changes the status of the node to
Failed. The protocol service restarts are logged in the event log. Issue the
mmhealth node eventlog command to view the details of such events.
If the system detects multiple problems simultaneously, then it starts the recovery procedure such as automatic restart, and addresses the issue of the highest priority event first. After the recovery actions are completed for the highest priority event, the system health is monitored again and then the recovery actions for the next priority event is started. Similarly, issues for all the events are handled based on their priority state until all failure events are resolved or the retry count is reached. For example, if the system has two failure events as
smb_down
andctdb_down
, then since thectdb_down
event has a higher priority, so the ctdb service is restarted first. After the recovery actions forctdb_down
event is completed, the system health is monitored again. If thectdb_down
issue is resolved, then the recovery actions for thesmb_down
event is started. - For CES HDFS, there is an extra active to passive switch on top of the basic CES failover. This switch moves the HDFS-dedicated IP addresses without affecting other protocols. For example, consider two protocol nodes, node 1 and node 2 that have active HDFS, NFS, and SMB. If the HDFS NameNode changed from active to standby or passive in protocol node 1, then the HDFS NameNode changes from standby to active in protocol node 2. However, the SMB and NFS on protocol node 1 and node 2 are not affected.