Manual quorum override in a stretched system

A manual quorum override is required when you experience a rolling disaster.

In rare situations, the system is subject to what is referred to as a rolling disaster. A rolling disaster occurs when an incident has wide scope, and its effects are felt in multiple steps over an extended time period. The following example scenario describes a rolling disaster and shows how to recover from that rolling disaster.

An example of a rolling disaster occurs when the following situation is true:
  1. The link between the two sites fails, at which point one site uses the automatic quorum feature to continue operation.
  2. The system site that has control of the quorum device fails (due to a power outage, for example).

This example leaves the second site as the only site that is potentially capable of continuing data I/O. However, it is unable to do so until it gains control of the quorum device. The MDisks in the second site stop. Nodes at the site display the node error 551, indicating that an insufficient number of nodes are available to form a quorum in a stretched system configuration.

In this scenario, you can run the quorum override command to override the automatic quorum device selection and create a new system that contains the nodes in the second site.
Note:
  • To ensure that the system is in the correct state before it is used, the quorum override command can be run only with assistance from support.
  • If a fabric disruption occurs while the quorum override command is running, it is possible that a subset of the nodes will update their system ID. The updated nodes display the node error 550. The nodes that were not updated display 551 and the nodes are assigned to two different systems. In this situation, you can run the quorum override command again on one of the nodes that reported the error 551. This command updates all the nodes in the two systems with a new cluster (system) ID. You can then recover data.

Enforcing conditions for a quorum

You must run the chsystem -topology stretched command as part of the installation process for the system to make the quorum override command available if a rolling disaster occurs. The quorum override command is not available in systems that do not have the topology set to stretched. Before you can use the command, the following prerequisites must be met:

  • All I/O groups with two nodes are assigned with one node in site 1 and the other in site 2.
  • All storage systems with MDisks must have their site that is defined.

When these prerequisites are met and automatic quorum selection is enabled, the system attempts to assign one quorum device within all three sites. If a site does not have an MDisk suitable to be a quorum device, a quorum device is not assigned to it.

Note: After the chsystem -topology stretched command is run, you cannot alter the site assignment of any controller except where that controller is a new controller that has only unmanaged MDisks.

It also does not allow site settings for nodes. This enforcement is required to ensure that the system operates correctly to allow the quorum override command to operate correctly.

When you run the chsystem -topology standard command, it again is possible to alter the site settings for nodes and controllers. However, this command disables the override quorum feature. Therefore, aim to run chsystem -topology stretched when you complete your changes to re-enable this support.