Contents


Hosted VMware environments and recovery solutions in IBM Bluemix Local System, Part 3

Building a disaster recovery solution with PureApplication Software

Comments

Content series:

This content is part # of # in the series: Hosted VMware environments and recovery solutions in IBM Bluemix Local System, Part 3

Stay tuned for additional content in this series.

This content is part of the series:Hosted VMware environments and recovery solutions in IBM Bluemix Local System, Part 3

Stay tuned for additional content in this series.

Editor's note: This tutorial applies to V2.2.3.0 of IBM® Bluemix® Local System and PureApplication® products and is no longer being updated. IBM is continuing to improve and add features that support hosted VMware environments and recovery solutions. Please check the Comments section at the bottom of this tutorial for links to updated content.

In Part 2 of this series, you created a separate PureApplication Software workload environment in a Bluemix Local System or PureApplication System. Using these new workload environments in PureApplication Software V2.2.3 builds a disaster recovery solution. Disaster recovery in this context is focused on how to recover an entire workload environment when a single Bluemix Local System experiences an outage in a particular data center. If you configured it correctly, the entire workload environment can be recovered, including the pattern catalog, pattern instances, virtual machines (VMs), all of the VM disks, and management of these components.

Two Bluemix Local Systems or PureApplication Systems, or a mix of the two, connect for purposes of disaster recovery. The intention is that one system is in one data center, and the second system is in a second data center. The goal is to support a zero data loss solution, if the data centers are close enough to each other, with an estimated recovery time measured in hours. This disaster recovery solution requires manual intervention upon a single data center failure. This tutorial focuses on how to set up disaster recovery and how to perform three different disaster recovery procedures.

Overview of disaster recovery procedures

An important feature of the disaster recovery solution in this tutorial is that you can do it on a granular level that is specific to a single PureApplication Software workload environment. That is, two different Bluemix Local Systems that are connected to each other can have multiple PureApplication Software workload environments that can run on either system and can be prepared for failover or recovery on the other system. For example, you can have one PureApplication Software workload environment on System A that replicates to System B for recovery. Then, at the same time, you can run another PureApplication Software workload environment on System B that replicates to System A.

Therefore, for each workload environment, you define the source system and the target system for disaster recovery or replication. The source system is the system where the production workload environment is running. The target system is the system where the workload environment can be recovered upon a failure at the source system. A workload environment is configured between the source system and target system. This configuration occurs by replicating all of the supporting storage volumes (both the block VMFS datastore volumes and block volumes) of a workload environment between the two systems by using Block Storage Replication.

This tutorial describes three disaster recovery procedures.

Practice failover

The practice failover procedure simulates an unexpected failure on the source system for a workload environment and tests the recovery procedures on a target system. The recovery environment on the target system that is created as part of this procedure is not intended to become the production environment in this procedure. Instead, the recovery environment is set up to verify that recovery can happen and then be cleaned up and removed.

You can do practice disaster recovery in one of the following ways:

  • Recovery with an isolated network. This approach can run production workloads on the source system, even while testing recovery. If you choose to practice disaster recovery with isolated networking, you must ensure that the network on the target system is isolated from the production environment. This way, duplication of host names and IP addresses on the production network does not occur, preventing production workload interruption. Do not disrupt block storage replication between the source and target systems as part of the network isolation.
  • Recovery with a temporary production outage. If you are unable to use recovery with an isolated network by using firewalls or appropriate network barrier techniques, temporarily stop production workloads on the source system.

Planned failover

The planned failover procedure is used to perform a failover of a workload environment from source system to target system. It makes the target system the source system and makes the source system the target system for a particular workload environment. For this procedure, you must stop and store all of the production pattern instances and VMs on the source system. You must also switch the direction of storage replication and recover the entire workload environment on the original target system to make it the new source system for the production workload. This procedure occurs with zero data loss and allows for two copies of the data at all times.

Although you might consider using this procedure for practicing disaster recovery, it requires more time than the practice disaster recovery option. In addition, this procedure is different from the actions that you perform in an actual unexpected disaster recovery. Therefore, do not use this procedure to practice disaster recovery.

Unexpected failover

Use the unexpected failover procedure to recover from a failure of the source system, or its data center, that impacts the client's ability to reach the production workload environment. In this case, you can choose to fail over the entire workload environment to the target system. This process looks identical to practice failover in how the recovery is done. The difference is that the recovery environment that is created in this case will become the production environment. You must ensure that the source system is disconnected from the production data networks before it comes back online and connects to the network again. This practice ensures that the original production system on the source system does not interfere with the new production workload environment that is running on the target system.

Planning for disaster recovery

Planning the disaster recovery solution has many essential parts, especially for using the PureApplication Software and Bluemix Local System disaster recovery capabilities in Version 2.2.3. You must plan for the following elements in a disaster recovery solution when you use the PureApplication disaster recovery features. You must plan many of these elements in advance of purchasing systems and choosing locations for them. Therefore, you must define your needs early in the solution planning.

Networking

During the disaster recovery procedure, the pattern-based instances with their corresponding VMs, in addition to the PureApplication Software VM, are started on the target system. Ensure that the host names, IP addresses, and data virtual local area networks (VLANs) that are assigned to these workloads when they are deployed on the source system are the same ones that are used on the target system.

You can configure other data VLANs and subnets on the source system or the target system for use by different workload environments or traditional cloud group deployments that are not configured for disaster recovery. However, you must identically configure each data VLAN used in a workload environment that is configured for disaster recovery, and then set up each VLAN on both systems.

Storage

Planning the amount of storage that is needed for disaster recovery is critical to ensure that you can recover from an unexpected disaster at a source system onto a target system.

First, you must plan for the amount of storage that is needed for the PureApplication Software workload environment on the source system. Part 2 in this series explains how to estimate the amount of storage that is needed for a single environment. You must dedicate the amount of storage for this environment on the source system.

Second, you must plan for the amount of storage that is needed for replication on the target system for each particular workload environment. This storage will equal the amount of storage from the source system because each volume, block, or block VMFS must have a replica volume of the same size on the target system. You must dedicate the amount of storage for this environment on the target system.

Third, you must plan for the amount of storage that is needed to recover on the target system, either in a practice disaster recovery procedure or a real recovery from an unexpected failure. For both of these procedures, create a clone on the target system for each of the replica storage volumes. You then use the cloned volumes to recover the workload environment. Therefore, reserve twice the amount of storage capacity on the target system than is used for the source workload environment.

Block storage replication

When you plan storage replication, you must answer the following key questions:

  • What is the distance between the two systems that will participate in replication?

    This value is important to determine the replication latency of the data.

    • If the distance is less than 300 km, you can use synchronous and asynchronous replication.
    • If the distance is more than 300 km, but less than 8000 km, you must use asynchronous replication.
    • If the distance is more than approximately 8000 km, you cannot use block storage replication.

    Only synchronous communication can guarantee zero data loss when a disaster occurs.

  • Should I make the volumes for my PureApplication Software workload environment replicate synchronously or asynchronously from the source system to target system?

    As described in the first question about distance, if the distance is less than 300 km, you can use synchronous or asynchronous replication. For data sensitive block or block VMFS volumes, use synchronous replication because it allows zero data loss. For example, if a block storage volume is used for a database, replicate that block storage volume synchronously. However, consider a block VMFS volume that has only VMs and snapshots, and the applications can handle a small data loss. In this case, you can replicate these block VMFS volumes asynchronously to achieve a lower latency of storage and better performance for the applications.

  • Do the two systems that I plan to connect to each other both support IP-based block storage replication?

    IP-based block storage replication is supported on Bluemix Local System and W2500 PureApplicaton Systems.

  • What network VLAN and subnet will I use for block storage replication between the systems?

    You can use different VLAN and subnets for IP-based block storage replication on each system. However, network routing must exist between the two systems so that each system can communicate with the other one over any of the three IP addresses that are used as part of block storage replication. In the context of a PureApplication Software Disaster Recovery solution, consider making this network (VLAN and subnet) separate from the shared data VLANs and subnets that are used on either system. This configuration allows more flexibility when you try to do isolated network practice disaster recovery.

    For more information about how to plan block storage replication, see the "Administering block storage replication" topic in the IBM PureApplication Software on IBM Bluemix Local W3550 2.2.3 documentation in IBM Knowledge Center.

Compute and cloud groups

You do not need to dedicate compute nodes in advance on the target system for disaster recovery. Instead, you can use the compute nodes for any other use, such as a traditional cloud groups or a different Virtual Manager cloud group.

Both compute nodes and Virtual Manager cloud groups must be available on the target system during the disaster recovery procedure. When not enough compute nodes are available on the target system to create a Virtual Manager cloud group that is needed to recover a workload environment, you must reassign compute nodes from other cloud group environments.

You must carefully plan the number of compute nodes that are dedicated to production environments on both systems. This way, recovery of the most critical production environments can always occur upon failure.

Configuring the systems for disaster recovery

After planning is complete, set up the systems to participate in a disaster recovery event. To prepare the target system to take over for the source system if the source system fails:

  1. Set up a shared data VLAN network between the source and target systems. The IP addresses that are used by the workloads and PureApplication Software cannot change when they run on the target Bluemix Local System environment. The data VLAN on the target system must be identical to the data VLAN on the source system so that the IP addresses match.
  2. Set up an MKS Console IP group that is configured for the MGMT VLAN. You must create the MKS Console IP Group on the target Bluemix Local System with one IP address for each compute node that is needed for VM console access. For more information, see "The MKS Console IP Group" section in the "Adding IP groups" topic in the IBM PureApplication Software on IBM Bluemix Local W3550 2.2.3 documentation.
  3. Create volumes that match the storage that was created for the PureApplication Software workload environment. For each storage volume that you created on the source Bluemix Local System to be managed by the PureApplication Software workload environment, create a storage volume of the same type and size on the target Bluemix Local System. For more information, see the "Adding Volumes" topic in the IBM PureApplication Software on IBM Bluemix Local W3550 2.2.3 documentation.
  4. Create a volume group on the target system to organize the storage for replication. To create one new volume group, follow the steps in the "Adding volume groups" topic in the IBM PureApplication Software on IBM Bluemix Local W3550 2.2.3 documentation. Specify (none) for the cloud group. Add each volume that you created in step 3 to this new volume:
    1. Go to the Cloud Volume Groups page, and select the Volume Group that you just created.
    2. In the details view, click Add Volume, and select Add Existing.
    3. Click the check box for each volume that you created in step 3. Click Add Volumes.
  5. Set up volume replication by configuring the IP addresses. To enable block storage replication between the source and target systems:
    1. Configure the management and replication IP addresses for IP-based block disk replication on each system. Follow the steps in the "Configuring IP addresses for IP-based block disk replication" topic in the PureApplication System W1500 2.2.3 documentation.
    2. Create a block storage replication profile on both the source and target systems. Follow the steps in the "Managing block storage replication profiles" topic in the PureApplication System W1500 2.2.3 documentation.
    3. You can adjust the rate at which the initial background disk replication occurs (default is 1000 megabits per second) from the source system to the target system.
      1. Use the storage_copy_bandwidth and background_copy_rate parameters in the "Block storage replication profile REST API" topic in the PureApplication System W1500 2.2.3 documentation to control the rate at which updates are propagated to the target system. For optimum replication, keep the bandwidth parameter less than the actual available network bandwidth so that you do not congest the fabric. Setting the bandwidth too high can delay the foreground I/O.
      2. Monitor the rate of replication (IP Remote Copy) from the Storage Controller Monitoring Performance browser page. To locate the IP address and user name of the storage controller, select System -> System Settings, and click External Application Access Settings.
  6. Start replication, and accept the request on the receiving rack. You must add each volume to be replicated for the workload environment to the block storage replication profile. For more information, see the "Adding volume pairs" topic in the PureApplication System W1500 2.2.3 documentation.
  7. Wait for replication to complete. The source system can fail over to the target system after the initial replication of each volume is complete. The replication status changes from Pending to Available.
  8. Validate that replication is completed. On the Volume Groups page of the Bluemix Local System web console for the target system, verify that the Overall Replication State field indicates Available. Overall Replication State is Available
    Overall Replication State is Available

Perform a disaster recovery failover

The following sections describe the failover steps for each of the disaster recovery scenarios for this article.

Planned disaster recovery failover

  1. On the PureApplication Software workload environment on the source system, stop all running instances. For more information about stopping the different types of deployed resources in PureApplication Software, see the "Managing Instances" topic in the IBM PureApplication Software for x86 V2.2.3 documentation.
  2. Store all the instances. Verify that all the VMs and instances are in a stored state.
  3. Shut down the PureApplication Software VM.
    1. Run Secure Shell (SSH) to the PureApplication Software VM for that environment by using admin_shell user ID and password for the administrator on the console.
    2. Run the psm shutdown command to shut down all the PureApplication Software services and the PureApplication Software VM.
  4. Verify in vCenter that all the VMs that are deployed by PureApplication Software are Stored (not listed in the cluster), except the PureApplication Software Manager VM.
  5. Record the port groups that the network adapters are attached to for the PureApplication Software VM. To find this information, open the vCenter web console, and select the PureApplication Software VM. In the Getting Started view, click Edit virtual machines settings. The port group is the value of the Network adapter 1 field.
  6. In vCenter, remove the PureApplication Software Manager VM from inventory, and leave the contents of the VM on the disk.
  7. Go to the system console for the system on which software was removed. Move all the storage volumes for that Virtual Manager cloud group out of the cloud group.
  8. Clone the volumes on the source system.
    1. To create one new volume group, follow the steps in the "Adding volume groups" topic in the IBM PureApplication Software on IBM Bluemix Local W3550 2.2.3 documentation. Specify (none) for the cloud group.
    2. Add all of the volumes that are associated with the PureApplication Software workload environment to this volume group.
    3. Clone the new volume group as a backup in case you need to reattempt the recovery. These clones are not used for the recovery attempt unless something breaks.

    Recovery on the initial target system uses the replicated volumes.

  9. Go to the Block Storage Replication page, and switch the replication direction:
    1. Click Failover on the current source system for all the volumes that you removed from the cloud group.
    2. In the Failover operation window, select the When the failover operation completes, the primary and backup volumes switch roles, and replication is then enabled in the reverse direction option. The source volume becomes the target volume, and the target volume becomes the source volume. For more information, see "Administering block storage replication" topic in the PureApplication System W1500 2.2.3 documentation. Primary and backup volumes switch roles
      Primary and backup volumes switch roles

      The switch takes a few minutes before the state of each volume returns to Available.

To recover on the new source system, complete the steps in Recover the deployed instances after a disaster recovery failover.

Practice disaster recovery failover

As part of the configuration for a disaster recovery, you should have defined a volume group that contains the volumes that are associated with the software workload. If you did not create the volume group as part of the disaster recovery configuration, create it now. Validate that all of the volumes that are associated with the software workload environment are assigned to this volume group.

You are now ready to do a practice failover:

  1. Clone the volume group of the target systems to a new volume group. Recovery of the target system will use the cloned volumes.
  2. While you maintain block storage replication between the two systems, isolate the target PureApplication Software environment from the source PureApplicaton software environment. In this step, you shut down the source software environment and workloads to establish the isolation instead of attempting to isolate the network between the systems. If you choose to develop a network isolation procedure to keep the production workloads up, you must not disrupt the block storage replication between the source and target systems.
    1. Stop all running instances on the source PureApplication Software. For more information about stopping the different types of deployed resources in PureApplication Software, see the "Managing Instances" topic in the PureApplication Software 2.2.3 documentation.
    2. Shut down the PureApplication Software VM:
      1. Run SSH to the PureApplication Software VM for that environment by using the admin_shell user ID and password for the administrator on the console.
      2. Run the psm shutdown command. This command shuts down all the PureApplication Software services and the software VM.
    3. Prevent access to the source Bluemix Local System virtual manager.
      When PureApplication Software starts on the target Bluemix Local System, the workload deployment engine attempts to start workloads that are no longer running. To prevent the deployment engine from interacting with the source environment, you must change the password for the virtual manager (VMware vCenter Server):
      1. On the source Bluemix Local System web console, select System -> System Settings.
      2. Expand External Application Access Settings, and select the Show Details for the user that is configured for the PureApplication Software environment.
      3. At the top of the External Users window, click the Regenerate Passwords link.

When PureApplication Software is started again on the source environment, update the Virtual Center Access and the compute node access with the new user name and password.

To practice recovery on the new target system, complete the steps in Recover the deployed instances after a disaster recovery failover.

Unplanned disaster recovery failover

  1. Validate that the source system is not accessible. The source system must not be running instances that can cause IP address conflicts with the recovery procedure. Also, make sure that, if this source system comes back online during recovery, it does not disrupt the production workload environment that is being recovered on the target system. Therefore, to prevent such problems, isolate networking to the source system by configuration, firewall changes, or unplugging data network wires.
  2. Verify that you defined a volume group that contains the volumes that are associated with the software workload environment. You should have completed this step as part of the disaster recovery configuration. If you did not create the volume group, create it now. Validate that all of the volumes that are associated with the software workload environment are assigned to this volume group.
  3. Clone the volume group to a new volume group. Recovery of the target system will use the cloned volumes.

To practice recovery on the new target system, complete the steps in Recover the deployed instances after a disaster recovery failover.

Recover the deployed instances after a disaster recovery failover

For information about the steps in this task, see Part 1 and Part 2 of this series. Keep in mind that you might have completed some of the following steps.

Important: For the planned failover scenario, the target system is the original target system before the direction of block storage replication was reversed.

Complete the following steps on the target system:

  1. Make sure that a Virtual Manager cloud group exists for each cloud group that you will recover. For more information, see the "Administering cloud groups" topic in the PureApplication System W1500 2.2.3 documentation.
  2. Verify that an External Application User is created that has permissions for each cloud group that is identified in step 1. See the "External Applications" topic in the PureApplication System W1500 2.2.3 documentation.
  3. Make sure that each Virtual Manager cloud group contains at least one compute node.

Associating volumes to a cloud group

Use care when you associate all of the volumes on the target system with the correct cloud group. If volumes A, B, and C are all part of the same cloud group on the source system, the equivalent volumes must each be part of the same cloud group on the target system. For the steps to work, you must ensure that the volumes are assigned correctly to the cloud groups to match the configuration on the source system.

  1. On the Cloud Volumes page, for each volume that is associated with the replicated target environment, assign it with the proper cloud group. Assigning volumes with the cloud group
    Assigning volumes with the cloud group
  2. Log in to the vSphere Web Client by using the Virtual Manager user of the target system. (To access the credentials for vSphere Web Client, select System ->System Settings, and click External Application Access Settings).
  3. From the Navigator, select Hosts and Clusters, and select a compute node from one of the clusters.
  4. Under the Manage tab, on the Storage tab, click Storage Devices. Make sure that the compute node is connected to all of the block storage volumes that are associated with the cloud group that the compute node is a member of.
  5. Repeat this procedure for each compute node that is used for recovery to validate that the compute node has access to all the storage that it needs. If any storage is missing, validate that the storage was properly assigned. Otherwise, you might need to remove the volume from the cloud group and add it again. Compute node is connected to the block storage                 volumes
    Compute node is connected to the block storage volumes

Locate and start the PureApplication Software VM

  1. In the vSphere Web Client, from the Navigator pane, select Hosts and Clusters. From the Cloud Group (Cluster) that contains the PureApplication Software VM datastore, select a compute node (Host).
  2. In the right pane, under the Related Objects tab, click the Datastores tab. Right-click the deployment datastore (the datastore that contains the PureApplication Software VM), and select Register VM. Registering the VM
    Registering the VM
  3. In the Select File window:
    1. Select the directory that matches the name of the PureApplication Software VM in the first column. In this example, the name is PureSoftwareManager. You will select the directory with the name of the software VM that you created.
    2. In the Contents column, select the PureSoftwareManager.vmx file, which is preselected for you.
    3. Click OK. Selecting the directory that matches the software                 VM
      Selecting the directory that matches the software VM
  4. In the Register Virtual Machine wizard, for Name and Location, expand the tree to find the folder with the same name as the Cloud Group. Select the folder as the inventory location. Click Next.Selecting the inventory location
    Selecting the inventory location
  5. On the Host/Cluster page, select the cluster (cloud group). Click Next.Selecting the cluster
    Selecting the cluster
  6. On the Specify a Specific Host page, select a host (compute node) from the list of hosts. Click Next. Selecting the host
    Selecting the host
  7. Click Finish to register the virtual machine.
  8. In the vSphere Web Client, in the Navigator pane, select the virtual machine. In the right pane, under Getting Started, click the Edit virtual machine settings link.Edit virtual machine settings
    Edit virtual machine settings
  9. In the PureSoftwareManager – Edit Settings window, to the right of Network Adapter 1, click the blank drop-down list, and select the correct network VLAN that the interface is associated with. This VLAN should be the same as the one on the source system for that PureApplication Software VM. By convention, the name of the network is the same as its VLAN ID for easy identification.Selecting the network VLAN for the interface
    Selecting the network VLAN for the interface

    Tip: If you do not see your VLAN listed, click Show more networks from the drop-down list to see the full list of VLANs. In the Select Network window, click the correct network, and click OK.

    Selecting the network
    Selecting the network
  10. Back in the Edit Settings window, click OK.
  11. Power on the PureApplication Software virtual machine in the vSphere Web Client.
  12. On the Summary tab, click Answer Question in the yellow information box. Information box on Summary tab
    Information box on Summary tab
  13. In the Answer Question window, select I Moved it, and click OK. Answer Question window
    Answer Question window
  14. After several minutes, access the web console of the PureApplication Software VM. Use the IP address that is specified on the Summary tab.
  15. Select System -> System Troubleshooting, and under System Management, verify that the status for Service code shows Online. If the services are not yet online, wait a few more minutes, and then check the status again.Service code status is online
    Service code status is online

Reconfigure the PureApplication Software

  1. Get the Datacenter name and the Distributed Switch name of the cluster:
    1. Log in to the vSphere Web Client by using the Virtual Manager user ID for the target system. (To access the credentials, select System ->System Settings, and click External Application Access Settings).
    2. On the Summary tab for the cluster, note the Datacenter name.
    3. Under the Related Objects tab, on the Distributed Switches tab, note the distributed switch name.
  2. In the PureApplication Software web console, select System -> System Settings. Expand Virtual Center Access. Change the settings to match the vCenter credentials for the application in the External Application Access Settings section of the System Settings page for Bluemix Local System. Enter the Datacenter name and Distributed Virtual Switch name from the previous step.Changing the Virtual Center Access settings
    Changing the Virtual Center Access settings
  3. Test the connection, and then save the changes.
  4. To make it easier to identify which Bluemix Local System the Software VM is running on, expand Customize Name, and change the System field to indicate that the Software VM is running on the new system. For example, enter DR Env #1 -Recovery on R44.Customizing the system name
    Customizing the system name
  5. Select Cloud -> Cloud Groups. On the Cloud Groups page, click the eye or discover icon.Eye or discover icon on the Cloud Groups page
    Eye or discover icon on the Cloud Groups page
  6. Select Patterns -> Virtual Machines. Verify that all the VMs (except the PureApplication Software VM) are in the Stored state. If some VMs are not in the Stored state, wait a few minutes for the state to change to Stored.Verifying that VMs are in the Stored state
    Verifying that VMs are in the Stored state
  7. Select Patterns -> Pattern Instances. Stop all instances that do not have a status of Stopped or Stored. The following figure shows that some instances have a status of Launching. You must change each instance that is in the Launching state to a Stopped state.Instance status
    Instance status
  8. For each instance that is not in a Stopped state, select that instance. In the Confirm window, click Stop.Stopping an instance
    Stopping an instance
  9. Validate that all of the instances are now stopped.Validating that instances are stopped
    Validating that instances are stopped
  10. Select Hardware -> Storage Resource.
  11. In the Storage Resource pane:
    1. On the Data Stores tab, verify that the data stores are all present and associated with the proper cloud group. Keep in mind that these data stores contain all the content that is created on the source system, but their names and cloud groups now match the resources that are defined on the target system.Verifying that the datastores are with the correct                 cloud group
      Verifying that the datastores are with the correct cloud group
    2. On the LUNs tab, manually remap the Block Storage. A discovered volume contains its LUN identifier in its name. Use the Cloud Volumes page of the target Bluemix Local System to identify the mapping between a volume name and the LUN identifier of the storage volume.

      Tip: You can identify which discovered storage volume is mapped to the storage volume that was defined on the source system is by its size. Whenever possible, allocate each volume on the source system with a unique size.

      For each storage LUN with a state of Unavailable, in the Actions column, click the green remap icon for the LUN.

      Remapping LUNs with an Unavailable                             state
      Remapping LUNs with an Unavailable state
  12. In the Reconnect LUNs window that shows the possible LUNs for which the discovered LUN can be remapped, select the correct LUN, and click OK.Selecting the correct LUN
    Selecting the correct LUN
  13. Select Hardware -> Compute Resources (Nodes). Select a compute node that has the eye or discover icon next to the compute node, and click Update.
  14. In the Update location information window, enter the new compute node ESXi user name, password, location, and IP address. You can find this information on the web console of Bluemix Local System. (Select System -> System Settings, and click External Application Access Settings.)Updating the location information
    Updating the location information
  15. Update any additional compute nodes that were discovered.
  16. Delete the old compute nodes that existed on the source rack. Select each compute node that is in an Unavailable state, and click Delete.Deleting compute nodes
    Deleting compute nodes

Verify that you reconfigured PureApplication Software after the failover.

Restart the pattern instances

Select Patterns -> Pattern Instances. Select and start each pattern instance that you want to recover. For information about starting instances, see the "Managing instances" topic in the PureApplication Software 2.2.3 documentation.


After you complete recovery of the failover, validate that all started instances recovered correctly. If you encounter problems during the recovery process, you can reattempt the recovery. First, complete the steps in Clean up after a practice failover, and then go back and complete the steps in Practice disaster recovery failover.

Remove the PureApplication Software workload environments

Part 2 of this series explains how to remove PureApplication Software workload environments. When these environments exist in a disaster recovery enabled setup, you must consider the additional steps that are highlighted in this section.

Clean up after a practice failover

During a practice failover, the production workloads typically keep running while the practice failover is tested in a network isolated environment. However, as described in Practice disaster recovery failover, you shut down the production environment temporarily so that you can test the practice environment without any network isolation. Network isolation disrupts the block storage replication between the source and target systems. Therefore, avoiding network isolation is necessary for the practice failover procedure.

After you verify the practice failover, remove the practice environment. To remove this temporary environment for recovery practice, follow the "Cleaning up production workload environment" instructions as explained in Part 2. All references to PureApplication Software in Part 2 refer to the practice environment, not the production environment. Make sure that the volumes that are deleted are the clones, not the replicating volumes that are still being synced with the production environment.

To restart your production PureApplication Software workload environment now that the practice environment is cleaned up:

  1. Log in to the vSphere Web Console for the source environment, and restart the VM that is used to host PureApplication Software.
  2. Wait for the PureApplication Software services to completely start. The External Application Access User passwords were changed to provide isolation from the production environment for the practice failover.
  3. Create a new Virtual Manager and compute node user. On the web console of the source Bluemix Local System, select System ->System Settings, and go to External Application Access Settings. Update the settings that you configured in PureApplication Software to use the new user name and passwords.
    1. On the System Settings page, reconfigure the Virtual Center Access settings to use the new Virtual Manager user name and password.
    2. On the Compute Resource page, update the settings of each compute node to use the new user name and password for that compute node.
  4. Restart the production workloads.

Clean up after an unplanned failover

For this task, the goal is to clean up the original production environment on the system that was recovered from an unplanned outage. Before you begin:

  • Ensure that the production environment is up and running on the backup system.
  • Make sure that the data network on the original system is disconnected from the core network to prevent duplicate IP addresses and host names.

To clean up after an unplanned failover:

  1. Activate the system management link on the original production system. Leave the data link inactive. This configuration allows access to the systems management console and vSphere Web Client without introducing address conflicts on the network with the restored workload on the target system.
  2. After an unplanned failover, stop the replication between the Bluemix Local Systems. From the web console of the source Bluemix Local System where the production workload environment is no longer running, select System -> Block Storage Replication. For each volume pair in the replication profile, click Delete. Delete only the replicas for the volumes that are associated with the software workload environment that is being removed.Deleting volume replicas
    Deleting volume replicas
  3. Log in to each Bluemix Local System web console. Select System -> Block Storage Replication. Select the Block Storage Replication Profile, and make sure that no volumes are listed on either system. If any volumes are listed, verify that they are not associated with the software workload environment that is being cleaned up.Block Storage Replication Profiles window
    Block Storage Replication Profiles window
  4. Log in to the vSphere Web Client on the original rack, and power off all of the VMs in the cloud groups for the previous production environment. After you log in to the vSphere Web Client, in the Navigator, select Hosts and Clusters. Right-click each running VM, and click Power Off. Make sure that the PureApplication Software VM is powered off in addition to any other running VMs in the cloud groups that are associated with this PureApplication Software workload environment.
  5. In the vSphere Web Client on the original rack, right-click each VM, including the PureApplication Software VM, and select Remove VM from Inventory. Make sure that all VMs are unregistered from all of the cloud groups that are associated with the vSphere Web Client environment on the original system.
  6. Clean up the original volumes and replica volumes on both systems.
    Important: You must have used cloned volumes from the replicas for recovery.
    1. In the web console of Bluemix Local System, select the Cloud -> Volumes.
    2. In the Volumes window, select the check box next to each volume that is being replicated.
    3. Click the trash can icon to delete all of the volumes that are selected. Cleaning up the original and replica volumes on both                 systems
      Cleaning up the original and replica volumes on both systems
    4. In the confirmation window, click Delete to acknowledge that multiple volumes will be deleted.

      Important: If the replicas were cloned and the clones are hosting the production workload environment, run this procedure on the backup Bluemix Local System that is now running the production workloads.

  7. Using the vSphere web client, verify that all the storage is removed for each of the compute nodes in the Virtual Manager cloud groups, except for the 5.2 GB boot LUNs for the ESXi.

    To find the storage devices, select the compute node, click the Manage tab, and then click Storage to see the list of storage devices.

    List of storage devices
    List of storage devices
  8. In the vSphere web client, select the cluster, click the Manage tab, and then, click Settings tab. Under Services, select the vSphere HA.
    If you see the message "vSphere HA is Turned ON" at the top of the details view, click Edit. In the Edit Cluster Settings window, expand the Admission Control section. Make sure that the Admission control policy is set to Do not reserve failover capacity.Editing vSphere HA
    Editing vSphere HA
  9. Restart each compute node in the Virtual Manager cloud groups on the original Bluemix Local System.
    1. In the Bluemix Local System web console, select Hardware -> Compute Nodes.
    2. Select each compute node that was part of a Virtual Manager cloud group that is managed by the cleaned up PureApplication Software VM, and click Power Off.
    3. After the compute nodes finish powering off, click Power On to restart each compute node.

Conclusion

Part 3 of this series examined how to create a disaster recovery solution by using PureApplication Software running on Bluemix Local System. It demonstrated how to plan for a disaster recovery, practice for a disaster recovery, and recover from an actual unexpected workload failure. In this series you learned how to use the advanced features of IBM Bluemix Local System, including VMware workload environments, PureApplication Software workload environments, and disaster recovery, to build enterprise-grade private cloud solutions.

Acknowledgements

The authors thank Kevin Cormier and Jessica Stevens for their assistance with the reviewing this article.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Middleware, Cloud computing
ArticleID=1048717
ArticleTitle=Hosted VMware environments and recovery solutions in IBM Bluemix Local System, Part 3: Building a disaster recovery solution with PureApplication Software
publish-date=10302017