High availability and disaster recovery (3-site replication)
Combining policy-based high availability and asynchronous policy-based replication for disaster recovery enables a 3-site replication solution for the most mission-critical workloads. You can manage the high availability and disaster recovery replication from the Storage partition page on the management GUI.
Policy-based HA provides a zero recovery point objective (RPO) solution using two independent storage systems with synchronous replication across metro-area distances and active-active access for hosts at both sites with seamless failover for zero recovery time objective (RTO). Asynchronous replication (RPO > 0) to a third system provides disaster recovery capability (RTO > 0) in the event of a metro-area disaster.
Three-site replication for a storage partition can be configured in any of the following ways:
- Starting with an existing 2-site high availability configuration, then adding replication for disaster recovery at a third side.
- Starting with an existing 2-site configuration for disaster recovery using storage partitions, then adding high availability to either of the partitions.
- Configuring 3-site replication for high availability and disaster recovery in a single operation.
Partnerships must be configured between all three systems. High availability requires either a Fibre Channel or an RDMA-based Ethernet partnership. The partnership for disaster recovery additionally supports the use of a long-distance TCP partnership. The different partnerships between systems can use different types of connectivity.
The use of separate node ports is recommended to isolate the different types of traffic for:
- High availability partnerships
- Disaster recovery partnerships
- Hosts and external storage, which can also be separated further if desired
Example error scenarios
This table lists examples of possible error scenarios and the expected behaviour in a 3-site solution, starting from an initial healthy state.
Scenario | High availability | Disaster recovery |
---|---|---|
Loss of connectivity between the non-preferred HA system and the DR system | No impact | No impact |
Loss of connectivity between the preferred HA system and the DR system |
The active management system will change to the non-preferred system to maintain replication to the DR system. The storage partition remains active-active with no interruption to host access. The preferred system will automatically become the active management system again when connectivity to the DR system is restored. |
No impact |
Loss of connectivity between the HA systems | If the connectivity issue lasts more than 3 seconds, HA will be suspended. The volumes within the storage partition will remain accessible through one of the two HA systems. | No impact. Replication will continue between the active management system and the DR system. |
Both HA systems lose connectivity to the DR system | No impact |
Replication is suspended. Replication will resume automatically when connectivity is restored. |
One or both of the HA systems lose connectivity to all quorum applications | No impact. However, the loss of quorum will prevent the non-preferred system from taking over the partition if the HA systems lose connectivity, or if the preferred system becomes unavailable. | No impact |
The preferred HA system is unavailable (for example, power failure, hardware failure) |
The storage partition will continue to be accessible on the non-preferred system. HA will automatically re-establish when the preferred system is available. |
Replication will continue between the active management system and the DR system. |
The non-preferred HA system is unavailable (for example, power failure, hardware failure) |
The storage partition will continue to be accessible on the preferred system. HA will automatically re-establish when the non-preferred system is available. |
Replication will continue between the active management system and the DR system. |
The DR system is unavailable (for example, power failure, hardware failure) | No impact |
Replication is suspended. Replication will resume automatically when the DR system is available. |
Loss of management IP connectivity only between the HA systems |
No impact on host I/O. If high availability is established for the partition, configuration changes that would result in the loss of high availability are blocked. If high availability is not established, configuration changes are permitted and HA will re-establish automatically when connectivity is restored. |
No impact. |
Loss of management IP connectivity only between the active management system and the DR system | No impact. |
Configuration changes made to the HA storage partition cannot be replicated to the DR system. Replication will be suspended if configuration changes are required on the DR system. If replication is suspended, it will resume automatically when connectivity is restored. |
Access is enabled to the recovery copy of one or more volume groups in the storage partition | No impact. | Replication is suspended for the independent volume groups. Host I/O and configuration changes are allowed on the HA and DR partitions independently. |
Replication is restarted to synchronize changes from the DR system to the HA systems |
HA is suspended while replication is running from the DR system to the active management system. HA will re-establish automatically when there are no volume groups replicating from the DR system. |
No impact. |