Contents


How to safeguard the KSYS node in IBM Geographically Dispersed Resiliency for Power systems?

Comments

What is IBM Geographically Dispersed Resiliency for Power Systems?

IBM® Geographically Dispersed Resiliency for Power Systems™ solution is a disaster recovery solution that is easy to deploy and provides an automated process to recover virtual machines (VMs) at the remote or failover site during a disaster. Because disaster recovery of applications and services is a key component to provide continuity for business, the IBM Geographically Dispersed Resiliency solution helps customers to have an automated disaster recovery process during a failure. Disaster recovery solutions are mainly based on cluster-based technology and virtual machine restart based technology. This solution provides an easy deployment model that uses a controller system (called KSYS) to monitor the entire virtual machine environment. This solution also provides flexible failover policies and storage replication management.

You can learn more about Geographically Dispersed Resiliency for Power Systems at the IBM developerWorks® wiki documents: Why GDR is the ideal DR solution for Power Systems and FAQ.

Key terminologies used in IBM Geographically Dispersed Resiliency

  • KSYS: KSYS is the logical partition (LPAR), currently an IBM AIX® LPAR, where Geographically Dispersed Resiliency software is deployed. KSYS acts as the orchestrator that monitors, manages and moves VMs from one site to another. KSYS stands for the C(K)ontroller System LPAR. KSYS is configured using the ksysmgr command in the following format: ksysmgr ACTION CLASS [NAME] [ATTRIBUTES...]
  • Site: This is a logical name that represents the primary or active site and the disaster recovery or backup site. Sites must be created at the KSYS level. All the Hardware Management Consoles (HMCs), hosts, Virtual I/O Server (VIOS) and storage devices are mapped to one of the sites. Sites can be of the following types:
    • Active site (or primary site): This refers to the current site where the workloads are running at a specific time.
    • Backup site (or disaster recovery site): This refers to the site that acts as a backup for the workload at a specific time. During a disaster or a potential disaster, workloads are moved to the backup site.
  • Host: A host is a managed system in HMC that is primarily used to run workloads. Hosts are identified by its universally unique identifier (UUID) as tracked in the HMC. A host pair indicates a set of hosts that are paired across the sites for high availability and disaster recovery.
  • Virtual machines: Virtual machines, also known as logical partition , are associated with specific VIOS partitions for a virtualized storage to run a workload. A host can contain multiple virtual machines.
  • Storage agents: A disaster recovery solution requires an organized storage management because storage is a vital entity in any data center. The GDR solution relies on data replication from the active site to the backup site. In the GDR solution, the data is replicated from the active site to the backup site by using storage replication.
  • Discovery of site: After the initial configuration is complete, the KSYS node discovers all the hosts that are managed by the HMCs in both the active and the backup sites and displays the status. During discovery, the KSYS node monitors the discovery of all LPARs or VMs in all the managed hosts within the selected site. The KSYS node collects the configuration information for each LPAR, and displays the status. The KSYS node discovers the disks of each VM and checks whether the VMs are configured currently for the storage devices mirroring.
  • Verification of site: In the verification phase, the KSYS node fetches information from the HMC to check whether the backup site is capable to host the VMs during a disaster. The KSYS node also verifies storage replication-related details.
  • Disaster recovery: After the verification phase, the KSYS node keeps monitoring the active site for any failures or issues in any of the resources in the site. When any planned or unplanned outages occur, and if the situation requires disaster recovery, you must manually initiate the recovery by using the ksysmgr command that moves the virtual machines to the backup site.

Problem statement

A recommended best practice of the KSYS design is that the KSYS node must be on a different site than the one that is running the production workload VMs. Such a design can ensure that the KSYS node is up and running even when a disaster strikes the production site, and can bring up workload on backup sites.

In this article, we discuss how to secure the KSYS node, when using only active and backup sites.

Figure 1. Recommended design

We recommend a simple best practice of running the KSYS node on the site that is not running the production workload. This means that when the production workload is running on the active site, the KSYS node must be running on the backup site and when workload move to the backup site, KSYS must be running on the active site. For example, let's say the production workload is running on the active site and the KSYS node is running on the backup site. Now, if a disaster strikes on the active site, the KSYS node running on the backup site will be able to initiate an unplanned move of the workload VMs to the other site. Similarly, when your production workload VMs are running on the backup site, the KSYS node must ideally be running on the active site.

This recommendation (though logically simple needs to address one important gap, how does the KSYS node swap itself to the opposite site when production workload VMs move?

Implementation of the recommended design

To ensure that the KSYS node can move from the active to backup site, and similarly from the backup site to the active site, we need to replicate the KSYS node VM disks in the opposite direction.

Let's assume that the production workload VMs are on the active site, and the KSYS node is running on the backup site. Now, let us walk you through the steps needed to achieve the replication.

Refer to the following sample hardware setup to be used:

Site 1 is the active site (in this example, Austin).

Site 2 is the backup site (in this example, India).

Host 1_1 is a host (managed system) on the active site (in this example, doit3-8233-E8B-06DA59R).

Host 2_1: is a host (managed system) on backup site (in this example, doit4-8233-E8B-06DA5AR).

VM: is the virtual machine (or LPAR) which is running with the production workload (in this example, demo_vm).

Storage: is the storage from where the disks are given to the KSYS node. In the following example, the KSYS node on the primary site has disks coming from EMC VMAX 508 and KSYS node on the backup site has disks coming from EMC VMAX 573.

KSYS node: is the node running ksysmgr. The ksysmgr command is used to create and handle configuration. We have named the LPAR as KSYS node.

Step 1: Creating a KSYS node

Create a KSYS node on the primary site, with similar configuration as that on the backup site. Ensure that KSYS node on the primary site is in not the activated state, whereas, on the backup site, it is up and running.

Figure 2. Setup details in HMC VMs on active site

Step 2: Replicating disk from primary to backup site

Next, mirror or replicate the disks of the backup site (India) KSYS node with the primary site (Austin). The mode of replication can be sync or async. We'll use the sync mode of replication for the demo.

The direction of replication is from the backup site to the active site.

Figure 3. Disk replication from backup site to active site

After the pairing succeeds, we can check the state of the disk to see if it is in the synchronized state. Synchronized state means that data in the secondary image is identical to that in the primary image.

Figure 4. Checking the disk pair state

Step 3: Initiating move from the primary site to the backup site

Let's assume that the disaster has struck the active site. We initiate a recovery process from the active site to the backup site. This can be done using the ksysmgr command:

Ksysmgr move site from=<active site> to=<backup site>
Figure 5. Moving site from active to backup
Figure 6. Moved VM details in HMC from active site to backup site

Step 4: Unmanaging existing KSYS node on the backup site

The KSYS node on the primary site (Austin) would be discovered and added to the KSYS configuration. So, we should unmanage the KSYS node present on the primary site (Austin)

This can be done as follows:

Ksysmgr unmanage vm {name=<VM name > host= <hostname> | uuid=<VM uuid>}

Step 5: Shutting down the KSYS node on the backup site

When the active site VMs (production VMs) have moved successfully to the backup site and the active site has completely recovered from the disaster, shut down the KSYS node from the backup site.

Change the direction of replication from the active site (Austin) to the backup site (India). This article illustrates the replication of Dell EMC storage disks.

  1. Disable the disk pair.
    Figure 7. Disabling disk pair synchronization
  2. Split the disk pair.
    Figure 8. Splitting the disk pair
  3. Swap the disk pair.
    Figure 9. Changing replication path
  4. Establish the disk pair from the active site (Austin) to the backup site (India).
    Figure 10. Establishing disk pair synchronization
  5. Wait and check until the disk pair is in the synchronized state.
    Figure 11. Verifying disk pair status

Step 6: Activating KSYS node

Activate the KSYS node from the primary site (Austin).

Figure 12. Activating KSYS nodes from HMC

Ensure that the IBM.VMR daemon (used by KSYS) is active after KSYS node migration. The KSYS node has now migrated to the current backup site and will monitor the VMs running on the current active site (India).

Figure 13. Checking IBM.VMR daemon status

After the production VMs are moved to the backup site, the KSYS node on backup (India) site would be discovered and added to the KSYS configuration. So, we should unmanage the KSYS node present on the backup (India) site.

This can be done as follows:

Ksysmgr unmanage vm {name=<VM name > host= <hostname> | uuid=<VM uuid>}

Invoke discovery and verification of site from the new KSYS node to validate the currently active site (India) and check whether the backup site (Austin) is able to host the VMs in the event of a disaster.

Figure 14. Discovery on new active site (India)
Figure 15. Verification process from the new KSYS node

Conclusion

This article illustrates how to always protect the KSYS node and ensure that the KSYS node is always monitoring the production VMs running workload.

Resources


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=1043600
ArticleTitle=How to safeguard the KSYS node in IBM Geographically Dispersed Resiliency for Power systems?
publish-date=01312017