Automatic recovery

The IBM Storage Scale recovers itself from certain issues without manual intervention.

The following automatic recovery options are available in the system:
  • Failover of CES IP addresses to recover from node failures. That is, if any important service or protocol service is broken on a node, the system changes the status of that node to Failed and moves the public IPs to healthy nodes in the cluster.
    A failover gets triggered due to the following conditions:
    1. If the IBM Storage Scale monitoring service detects a critical problem in any of the CES components such as NFS, SMB, or OBJ, then the CES state is set to FAILED and it triggers a failover.
    2. If the IBM Storage Scale daemon detects a problem with the node or cluster such as expel node, or quorum loss, then it runs callbacks and a failover is triggered.
    3. The CES framework also triggers a failover during the distribution of IP addresses as specified in the distribution policy.
  • If there are any errors with the SMB and Object protocol services, the system restarts the corresponding daemons. If restarting the protocol service daemons does not resolve the issue and the maximum retry count is reached, the system changes the status of the node to Failed. The protocol service restarts are logged in the event log. Issue the mmhealth node eventlog command to view the details of such events.

    If the system detects multiple problems simultaneously, then it starts the recovery procedure such as automatic restart, and addresses the issue of the highest priority event first. After the recovery actions are completed for the highest priority event, the system health is monitored again and then the recovery actions for the next priority event is started. Similarly, issues for all the events are handled based on their priority state until all failure events are resolved or the retry count is reached. For example, if the system has two failure events as smb_down and ctdb_down, then since the ctdb_down event has a higher priority, so the ctdb service is restarted first. After the recovery actions for ctdb_down event is completed, the system health is monitored again. If the ctdb_down issue is resolved, then the recovery actions for the smb_down event is started.

  • For CES HDFS, there is an extra active to passive switch on top of the basic CES failover. This switch moves the HDFS-dedicated IP addresses without affecting other protocols. For example, consider two protocol nodes, node 1 and node 2 that have active HDFS, NFS, and SMB. If the HDFS NameNode changed from active to standby or passive in protocol node 1, then the HDFS NameNode changes from standby to active in protocol node 2. However, the SMB and NFS on protocol node 1 and node 2 are not affected.