Node down recovery in a multiple-node cluster

If an active control node (where the MMM process is running) becomes unavailable, an automatic failover occurs to restart the MMM on the redundant control node. Restarting the MMM on the redundant control node allows the IBM Storage Archive Enterprise Edition service to continue to be available.

However, even with automated failover processing the following conditions require manual error recovery procedures:
  • When a node becomes unavailable, ongoing tasks on the node fail. User action is required to retry the failed tasks. For example, if a node fails during migration processing, the migration fails. The failed migration can be identified in the log file. The user must rerun the failed task.
  • When a node becomes unavailable, the tape drives that are assigned to the node cannot be used. If a tape is mounted on the tape drive, the tape is also unavailable. In this case, recall operations that require a file on the tape fail. To make the tape drive available, the user can either make the node available again, or temporarily reassign the tape drive to another node. To assign the tape drive to another node, see Changing tape drive assignment in a multiple node cluster.

If there is not an eligible redundant control node, failover processing does not occur. The monitoring daemon (MD) attempts to restart the MMM on the failing active control node. There are threshold limits to how long, and how many times, the MD tries to restart the MMM. If the MMM does not restart, the user must correct the problem on the failing active control node, then return that node to operation.

The user can also use the eeadm cluster restart or eeadm node down/up command to restore the control node to operation.