Failover rehearsal of the disaster recovery operation

The KSYS subsystem can perform a failover rehearsal at the backup site in the disaster recovery environment, without disrupting the production workloads or the storage replication from the active site to the backup site.

The failover rehearsal feature is supported for the following storage subsystems:
  • EMC SRDF family of storage systems (for example, VMAX)
  • SAN Volume Controller (SVC) and Storwize® family of storage systems
  • DS8000® storage systems
  • Hitachi storage systems
  • IBM XIV Storage System
  • EMC Unity Storage System
Restriction:
  • The DR failover rehearsal feature is supported only for disaster recovery and not high availability.
  • The DR failover rehearsal feature is not supported for heterogeneous storage systems and shared storage model of deployment.

The failover rehearsal feature is useful to rehearse the disaster recovery operation without performing a real DR failover and to test the readiness of the entire environment. It provides you the flexibility to perform DR testing more frequently. These failover rehearsal operations allow you to perform various workload-related tests that include write operations in the virtual machines (VMs) for a longer period of time.

You can test the disaster recovery operation at host group level or at the entire site level. In a test disaster recovery operation, the virtual machines in the active site continue to run the existing workloads and are not shut down. The storage replication between the storage devices in the active site and the backup site is also not impacted by the test disaster recovery operation.

Since the VMs continue to run in the active site when duplicate test VMs are started on the backup site as part of the failover rehearsal operation, you must ensure network isolation between the active site VMs and the test VMs that are started on the backup site. You can change the VLANs of the test VMs by using the network attribute of the ksysmgr modify command that helps to isolate the network on the backup site or you can use some other method to achieve network isolation between the sites.

The following figure shows a high-level flow diagram for a DR failover rehearsal as compared to a regular DR failover operation.

Figure 1. DR failover rehearsal flow
DR failover rehearsal flow
Notes:
  • You can perform DR failover rehearsal for a single host group, multiple host groups, single workgroup, all workgroups in host group or the entire site. If you are running the failover rehearsal operation at host group level, you must start the operation sequentially for each host group. Similarly, if you are running the failover rehearsal operation at workgroup level, you must start the operation sequentially for each workgroup in a host group. If you are running the failover rehearsal operation at site level, all the host groups are handled in parallel.
  • After the DR failover rehearsal starts for a host group or a workgroup, you cannot perform regular move operations on the same host group or the same workgroup, or on the entire site for the entire duration of DR failover rehearsal time.
  • You must not perform Live Partition Mobility operation of virtual machines into or out of the host group or the workgroup that is under the test operation for the entire duration of DR failover rehearsal. When the LPM operation is started directly through the HMC, the KSYS subsystem cannot stop those operations. Therefore, if a host group or a workgroup is under DR failover rehearsal and a VM is moved into or out of the host group or workgroup as part of the LPM activity, the results are unpredictable.
  • If a DR failover rehearsal operation is in progress, you cannot change the configuration settings at the KSYS level for any host groups or workgroups, including those host groups or workgroups on which DR failover rehearsal operation is not being performed. The ksysmgr commands that are used to modify, delete, manage, or unmanage host groups and workgroups are blocked until all steps of the DR failover rehearsal operation are completed.
  • If a DR failover rehearsal move operation is in progress, do not change the configuration settings at the HMC level for any host groups or workgroups, including those host groups or workgroups on which the DR failover rehearsal move operation is not being performed. All discover, verify, move, and failover rehearsal operations are blocked until the DR failover rehearsal move operation is completed. After the DR failover rehearsal move operation completes, the discovery, verify, move, and failover rehearsal operations are unblocked and you can run the discovery, verify, move, and failover rehearsal operations for host groups and workgroups.
  • For periods of time that are marked as atomic operation in the figure, you cannot start DR failover rehearsal operation or a regular move operation for any other host group, workgroup, or site.
  • You can perform regular move operation for the other host groups and workgroups other than the host groups and workgroups that are in DR failover rehearsal mode. Thereby you can recover host group HG1 while simultaneously testing host group HG2, and you can recover workgroup WG1 while simultaneously testing workgroup WG2.
  • If a real disaster occurs during a planned DR test time and you want to perform a real recovery related to a disaster, you can quit the DR failover rehearsal mode by executing the cleanup step.
  • The cleanup step in regular DR move operation is different than the DR test move operation. If failover rehearsal operation is performed, you must manually perform the cleanup steps.
  • For a regular unplanned DR move operation, the cleanup step is mandatory. The cleanup step must be performed after the move operation is complete at the earlier-active site.
  • When the test-cleanup operation is in progress, you cannot perform any other operations on the KSYS subsystem. You cannot work on the test VMs also because the test VMs are deleted as part of the cleanup process.
  • When you perform the rehearsal cleanup operation, the VM console that is opened in the HMC must be closed.
  • If you want to save a snapshot of the current configuration settings of your KSYS environment, you must save a detailed snapshot so that the tertiary disk values are also retained in the snapshot. If you want to know the tertiary disk information after restoring a basic snapshot, you must run the discovery operation with the dr_test flag.
  • The error messages for any failover rehearsal operations are displayed only in the output message of the operation. The error messages are not displayed in the output of the ksysmgr query system status command.
  • The command ksysmgr -t verify site site_name dr_test=yes verifies the disk mapping status at the storage level. This command does not start the regular verify operation.
  • During the failover rehearsal operation, the quick-discovery feature is blocked until the failover rehearsal operation is completed. Also, the event notification is blocked.
  • The storage disks for a managed VM must be configured such that the disks can be queried from at least two KSYS managed Virtual I/O Severs (VIOS). If any one VIOS goes down, the KSYS subsystem can still get the complete storage disk details from the other VIOS.

    Consider a scenario in which the boot disk is assigned through VIOS1 and the data disk is assigned through VIOS2 in a managed virtual machine VM1. If VIOS2 goes down, and if the discovery operation is being run at the KSYS subsystem, the KSYS subsystem can find only one disk for VM1 from VIOS1. Even though the KSYS subsystem indicates that the VIOS2 is down and GETLSI (get disk details from VIOS2) failure warning event occurs, the KSYS subsystem can perform the disk pair and disk group activities only for the boot disk.

  • If you are using an XSD that is earlier than version 8.0, the following message is displayed during the discovery operation:
    Redundancy path enable require VIOS 3.1.3.XX (XSD Version 8), or later
  • After an unplanned move operation in the HADRHA configuration, if failure occurs during the sync DB (PrDiscovery) process of the discovery operation, the ksysmgr command-line interface does not display the progress details of high availability operations, although the KSYS subsystem performs the required high availability operation.
  • start of
changeA managed virtual machine that is in an inactive site does not move during the site or host group move operation. The VM_NO_TARGET_HOST event is logged in the KSYS subsystem for the virtual machine.end of
change
  • The disaster recovery (DR) failover rehearsal operation from the backup site to the home site is not supported if the operation involves unmanaged disk. The disaster recovery (DR) failover rehearsal operation is supported only from the home site to the backup site with managed disks.

The following figure shows an example of failover rehearsal of all the virtual machines in a host:

Figure 2. Example for failover rehearsal of the disaster recovery operation
Example for failover rehearsal of the disaster recovery operation

Storage prerequisite

The storage administrator must have mapped all the hosts in the backup site to the backup storage disks (D1, D2, and so on). The storage administrator must also have created a set of clone disks (C1, C2, and so on) that are of the same number and size as the active site storage disks (P1, P2, and so on) and backup storage disks (D1, D2, and so on). The cloning (D1-C1, D2-C2, and so on) must be started from the backup storage disks to the clone disks. The storage administrator can set up the cloning relationship by using interfaces (command-line or graphical user interface) that are provided by specific storage vendors. Refer to documentation from the storage vendor for more details about allocation of storage disks and establishing relationship with secondary copy of data on the backup site. The following table lists the tools that are necessary for establishing cloning relationship for various storage systems.

For IBM SAN Volume Controller (SVC) storage system, ensure that the FlashCopy clone copies are not created with the auto-delete option. The VM Recovery Manager DR solution does not support the auto-delete option for IBM SAN Volume Controller (SVC) storage system.

Table 1. Storage vendors and the corresponding cloning feature
Storage vendor Clone feature Sample command Reference
EMC SRDF family of storage system (for example, VMAX) symclone
symclone -sid <sid> -f clone create -diff -nop -force
 symclone -sid <sid> -f clone activate -nop -force
EMC Solutions Enabler CLI User Guide
SAN Volume Controller (SVC) and Storwize family of storage systems Flashcopy
svctask mkfcmap -cleanrate 0 -copyrate 0 -source <D1>  -target <C1>
Managing Copy Services in SVC
DS8000 storage system Flashcopy
dscli -user <user_name>-passwd <password> -hmc1 <ds8k_ip> mkflash  -persist -dev <serial_number>  <D1>:<C1>
Redbook: IBM® System Storage DS8000 Copy Services Scope Management and Resource Groups
IBM XIV Storage System Snapshot For sync type of replication:
xcli -u <user_name> -p <password> -m <XIV_IP> snapshot_create vol=<D1> name=<flash_disk>
For async type of replication, the KSYS subsystem creates the clone disks automatically.
 
Hitachi Storage Systems Shadow image
  1. Create the Hitachi Open Remote Copy Manager (HORCM) configuration file with the DR instance in the KSYS node. Ensure that the DR instance configured in the HORCM is in active state.
  2. In the GUI of the target storage, create a dummy host on the port which has the target host.
    Note: Do not assign worldwide port name (WWPN) in this stage.
  3. Create a Logical Device (LDEV) with the size equal to the size of the target logical disk. The LDEV will be used as the shadow image of the target logical device.
  4. Map the path of the logical unit number (LUN) to the dummy host you created in the earlier steps.
  5. To create a clone, navigate to Replication > Local replication > Create SI.
  6. Once the shadow image is created, resync the shadow image. To resync the pair, navigate to Replication > Local replication > Select the replication > Resync Pair
EMC Unity No need to create a clone disk. It will be created automatically during the DR rehearsal discovery operation

Not required