Failover rehearsal of the disaster recovery operation
The KSYS subsystem can perform a failover rehearsal at the backup site in the disaster recovery environment, without disrupting the production workloads or the storage replication from the active site to the backup site.
- EMC SRDF family of storage systems (for example, VMAX)
- SAN Volume Controller (SVC) and Storwize® family of storage systems
- DS8000® storage systems
- Hitachi storage systems
- IBM XIV Storage System
- EMC Unity Storage System
- The DR failover rehearsal feature is supported only for disaster recovery and not high availability.
- The DR failover rehearsal feature is not supported for heterogeneous storage systems and shared storage model of deployment.
The failover rehearsal feature is useful to rehearse the disaster recovery operation without performing a real DR failover and to test the readiness of the entire environment. It provides you the flexibility to perform DR testing more frequently. These failover rehearsal operations allow you to perform various workload-related tests that include write operations in the virtual machines (VMs) for a longer period of time.
You can test the disaster recovery operation at host group level or at the entire site level. In a test disaster recovery operation, the virtual machines in the active site continue to run the existing workloads and are not shut down. The storage replication between the storage devices in the active site and the backup site is also not impacted by the test disaster recovery operation.
Since the VMs continue to run in the active site when duplicate test VMs are started on the backup site as part of the failover rehearsal operation, you must ensure network isolation between the active site VMs and the test VMs that are started on the backup site. You can change the VLANs of the test VMs by using the network attribute of the ksysmgr modify command that helps to isolate the network on the backup site or you can use some other method to achieve network isolation between the sites.
The following figure shows a high-level flow diagram for a DR failover rehearsal as compared to a regular DR failover operation.
- You can perform DR failover rehearsal for a single host group, multiple host groups, single workgroup, all workgroups in host group or the entire site. If you are running the failover rehearsal operation at host group level, you must start the operation sequentially for each host group. Similarly, if you are running the failover rehearsal operation at workgroup level, you must start the operation sequentially for each workgroup in a host group. If you are running the failover rehearsal operation at site level, all the host groups are handled in parallel.
- After the DR failover rehearsal starts for a host group or a workgroup, you cannot perform regular move operations on the same host group or the same workgroup, or on the entire site for the entire duration of DR failover rehearsal time.
- You must not perform Live Partition Mobility operation of virtual machines into or out of the host group or the workgroup that is under the test operation for the entire duration of DR failover rehearsal. When the LPM operation is started directly through the HMC, the KSYS subsystem cannot stop those operations. Therefore, if a host group or a workgroup is under DR failover rehearsal and a VM is moved into or out of the host group or workgroup as part of the LPM activity, the results are unpredictable.
- If a DR failover rehearsal operation is in progress, you cannot change the configuration settings at the KSYS level for any host groups or workgroups, including those host groups or workgroups on which DR failover rehearsal operation is not being performed. The ksysmgr commands that are used to modify, delete, manage, or unmanage host groups and workgroups are blocked until all steps of the DR failover rehearsal operation are completed.
- If a DR failover rehearsal move operation is in progress, do not change the configuration settings at the HMC level for any host groups or workgroups, including those host groups or workgroups on which the DR failover rehearsal move operation is not being performed. All discover, verify, move, and failover rehearsal operations are blocked until the DR failover rehearsal move operation is completed. After the DR failover rehearsal move operation completes, the discovery, verify, move, and failover rehearsal operations are unblocked and you can run the discovery, verify, move, and failover rehearsal operations for host groups and workgroups.
- For periods of time that are marked as atomic operation in the figure, you cannot start DR failover rehearsal operation or a regular move operation for any other host group, workgroup, or site.
- You can perform regular move operation for the other host groups and workgroups other than the host groups and workgroups that are in DR failover rehearsal mode. Thereby you can recover host group HG1 while simultaneously testing host group HG2, and you can recover workgroup WG1 while simultaneously testing workgroup WG2.
- If a real disaster occurs during a planned DR test time and you want to perform a real recovery related to a disaster, you can quit the DR failover rehearsal mode by executing the cleanup step.
- The cleanup step in regular DR move operation is different than the DR test move operation. If failover rehearsal operation is performed, you must manually perform the cleanup steps.
- For a regular unplanned DR move operation, the cleanup step is mandatory. The cleanup step must be performed after the move operation is complete at the earlier-active site.
- When the test-cleanup operation is in progress, you cannot perform any other operations on the KSYS subsystem. You cannot work on the test VMs also because the test VMs are deleted as part of the cleanup process.
- When you perform the rehearsal cleanup operation, the VM console that is opened in the HMC must be closed.
- If you want to save a snapshot of the current configuration settings of your KSYS environment, you must save a detailed snapshot so that the tertiary disk values are also retained in the snapshot. If you want to know the tertiary disk information after restoring a basic snapshot, you must run the discovery operation with the dr_test flag.
- The error messages for any failover rehearsal operations are displayed only in the output message of the operation. The error messages are not displayed in the output of the ksysmgr query system status command.
- The command
ksysmgr -t verify site site_name dr_test=yes
verifies the disk mapping status at the storage level. This command does not start the regular verify operation. - During the failover rehearsal operation, the quick-discovery feature is blocked until the failover rehearsal operation is completed. Also, the event notification is blocked.
- The storage disks for a managed VM must be
configured such that the disks can be queried from at least two KSYS managed Virtual I/O Severs
(VIOS). If any one VIOS goes down, the KSYS subsystem can still get the complete storage disk
details from the other VIOS.
Consider a scenario in which the boot disk is assigned through VIOS1 and the data disk is assigned through VIOS2 in a managed virtual machine VM1. If VIOS2 goes down, and if the discovery operation is being run at the KSYS subsystem, the KSYS subsystem can find only one disk for VM1 from VIOS1. Even though the KSYS subsystem indicates that the VIOS2 is down and GETLSI (get disk details from VIOS2) failure warning event occurs, the KSYS subsystem can perform the disk pair and disk group activities only for the boot disk.
- If you are using an XSD that is earlier than version 8.0,
the following message is displayed during the discovery
operation:
Redundancy path enable require VIOS 3.1.3.XX (XSD Version 8), or later
- After an unplanned move operation in the HADRHA configuration, if failure occurs during the sync DB (PrDiscovery) process of the discovery operation, the ksysmgr command-line interface does not display the progress details of high availability operations, although the KSYS subsystem performs the required high availability operation.
A managed virtual machine that is in an inactive site does not move during the site or host group move operation. The VM_NO_TARGET_HOST event is logged in the KSYS subsystem for the virtual machine.
- The disaster recovery (DR) failover rehearsal operation from the backup site to the home site is not supported if the operation involves unmanaged disk. The disaster recovery (DR) failover rehearsal operation is supported only from the home site to the backup site with managed disks.
The following figure shows an example of failover rehearsal of all the virtual machines in a host:
Storage prerequisite
The storage administrator must have mapped all the hosts in the backup site to the backup storage disks (D1, D2, and so on). The storage administrator must also have created a set of clone disks (C1, C2, and so on) that are of the same number and size as the active site storage disks (P1, P2, and so on) and backup storage disks (D1, D2, and so on). The cloning (D1-C1, D2-C2, and so on) must be started from the backup storage disks to the clone disks. The storage administrator can set up the cloning relationship by using interfaces (command-line or graphical user interface) that are provided by specific storage vendors. Refer to documentation from the storage vendor for more details about allocation of storage disks and establishing relationship with secondary copy of data on the backup site. The following table lists the tools that are necessary for establishing cloning relationship for various storage systems.
For IBM SAN Volume Controller (SVC) storage system, ensure that the FlashCopy clone copies are not created with the auto-delete option. The VM Recovery Manager DR solution does not support the auto-delete option for IBM SAN Volume Controller (SVC) storage system.
Storage vendor | Clone feature | Sample command | Reference |
---|---|---|---|
EMC SRDF family of storage system (for example, VMAX) | symclone |
|
EMC Solutions Enabler CLI User Guide |
SAN Volume Controller (SVC) and Storwize family of storage systems | Flashcopy |
|
Managing Copy Services in SVC |
DS8000 storage system | Flashcopy |
|
Redbook: IBM® System Storage DS8000 Copy Services Scope Management and Resource Groups |
IBM XIV Storage System | Snapshot | For sync type of replication:
For
async type of replication, the KSYS subsystem creates the clone disks automatically. |
|
Hitachi Storage Systems | Shadow image |
|
|
EMC Unity | No need to create a clone disk. It will be created automatically during the DR rehearsal discovery operation |
Not required |