Failover rehearsal of the disaster recovery operation

The KSYS subsystem can perform a failover rehearsal at the backup site in the disaster recovery environment, without disrupting the production workloads or the storage replication from the active site to the backup site.

The failover rehearsal feature is supported for the following storage subsystems:

EMC SRDF family of storage system (for example, VMAX)
SAN Volume Controller (SVC) and Storwize® family of storage systems
DS8000® storage system

Notes:

The DR failover rehearsal feature is supported only for disaster recovery and not high availability.
The DR failover rehearsal feature is not supported for heterogeneous storage systems and shared storage model of deployment.
The DR failover rehearsal feature is not supported for Hitachi storage systems.

The failover rehearsal feature is useful to rehearse the disaster recovery operation without performing a real DR failover and to test the readiness of the entire environment. It provides you the flexibility to perform DR testing more frequently. These failover rehearsal operations allow you to perform various workload-related tests that include write operations in the virtual machines (VMs) for a longer period of time.

You can test the disaster recovery operation at host group level or at the entire site level. In a test disaster recovery operation, the virtual machines in the active site continue to run the existing workloads and are not shut down. The storage replication between the storage devices in the active site and the backup site is also not impacted by the test disaster recovery operation.

Since the VMs continue to run in the active site when duplicate test VMs are started on the backup site as part of the failover rehearsal operation, you must ensure network isolation between the active site VMs and the test VMs that are started on the backup site. You can change the VLANs of the test VMs by using the network attribute of the ksysmgr modify command that helps to isolate the network on the backup site or you can use some other method to achieve network isolation between the sites.

The following figure shows a high-level flow diagram for a DR failover rehearsal as compared to a regular DR failover operation.

Figure 1. DR failover rehearsal flow

Notes:

You can perform DR failover rehearsal for a single host group, multiple host groups, or the entire site. If you are running the failover rehearsal operation at host group level, you must start the operation sequentially for each host group. If you are running the failover rehearsal operation at site level, all the host groups are handled in parallel.
After the DR failover rehearsal starts for a host group, you cannot perform regular move operations on the same host group or on the entire site for the entire duration of DR failover rehearsal time.
You must not perform Live Partition Mobility operation of virtual machines into or out of the host group that is under the test operation for the entire duration of DR failover rehearsal. When the LPM operation is started directly through the HMC, the KSYS subsystem cannot stop those operations. Therefore, if a host group is under DR failover rehearsal and a VM is moved into or out of the host group as part of the LPM activity, the results are unpredictable.
You cannot change the configuration settings for host groups that are under DR failover rehearsal operation.
For periods of time that are marked as atomic operation in the figure, you cannot start DR failover rehearsal operation or a regular move operation for any other host group or site.
You can perform regular move operation for the other host groups other than the host groups that are in DR failover rehearsal mode. Thereby you can recover host group HG1 while simultaneously testing host group HG2.
If a real disaster occurs during a planned DR test time and you want to perform a real recovery related to a disaster, you can quit the DR failover rehearsal mode by executing the cleanup step.
The cleanup step in regular DR move operation is different than the DR test move operation. If failover rehearsal operation is performed, you must manually perform the cleanup steps.
For a regular unplanned DR move operation, the cleanup step is mandatory. The cleanup step must be performed after the move operation is complete at the earlier-active site.
When the test-cleanup operation is in progress, you cannot perform any other operations on the KSYS subsystem. You cannot work on the test VMs also because the test VMs are deleted as part of the cleanup process.
When you perform the rehearsal cleanup operation, the VM console that is opened in the HMC must be closed.
If you want to save a snapshot of the current configuration settings of your KSYS environment, you must save a detailed snapshot so that the tertiary disk values are also retained in the snapshot. If you want to get the tertiary disk information after restoring a basic snapshot, you must run the discovery operation with the dr_test flag.
The error messages for any failover rehearsal operations are displayed only in the output message of the operation. The error messages are not displayed in the output of the ksysmgr query system status command.

The following figure shows an example of failover rehearsal of all the virtual machines in a host:

Figure 2. Example for failover rehearsal of the disaster recovery operation

Storage prerequisite

The storage administrator must have mapped all the hosts in the backup site to the backup storage disks (D1, D2, and so on). The storage administrator must also have created a set of clone disks (C1, C2, and so on) that are of the same number and size as the active site storage disks (P1, P2, and so on) and backup storage disks (D1, D2, and so on). The cloning (D1-C1, D2-C2, and so on) must be started from the backup storage disks to the clone disks. The storage administrator can set up the cloning relationship by using interfaces (command-line or graphical user interface) that are provided by specific storage vendors. Refer to documentation from the storage vendor for more details about allocation of storage disks and establishing relationship with secondary copy of data on the backup site. The following table lists the tools that are necessary for establishing cloning relationship for various storage systems.

Table 1. Storage vendors and the corresponding cloning feature
Storage vendor	Clone feature	Reference
EMC SRDF family of storage system (for example, VMAX)	`symclone`	EMC Solutions Enabler CLI User Guide
SAN Volume Controller (SVC) and Storwize family of storage systems	`Flashcopy`	Managing Copy Services in SVC
DS8000 storage system	`Flashcopy`	Redbook: IBM® System Storage® DS8000 Copy Services Scope Management and Resource Groups