Failback procedures
Which failback procedure you follow depends upon whether the nodes and disks at the affected site have been repaired or replaced.
If the disks have been repaired, you must also consider
the state of the data on the failed disks:
- For nodes and disks that have been repaired and you are certain the
data on the failed disks has not been changed, follow either:
- failback with temporary loss and no configuration changes
- failback with temporary loss and configuration changes
- If the nodes have been replaced and either the disks have been replaced or repaired, and you are not certain the data on the fail disks has not been changed, follow the procedure for failback with permanent loss.
Delayed failures: In certain failure
cases the loss of data may not be immediately apparent. For example,
consider this sequence of events:
- Site B loses connectivity with sites A and C.
- Site B then goes down due to loss of node quorum.
- Sites A and C remain operational long enough to modify some of the data on disk but suffer a disastrous failure shortly afterwards.
- Node and file system descriptor quorums are overridden to enable access at site B.
- Remove the damaged disks at sites A and C.
- Either replace the disk and format a new NSD or simply reformat the existing disk if possible.
- Add the disk back to the file system, performing a full resynchronization of the file system's data and metadata and restore the replica balance using the mmrestripefs command.