Achieving business continuity (that is, making sure that your critical business functions can be performed in the face of a disaster) is always a tricky business. Designing systems to be available in the face of failure is one of the most complex problems in software engineering. Even understanding what business continuity and disaster recovery mean is where things just begin to get complicated.
For example, what most users mean when they discuss disaster recovery is the ability to get a site up and running in a matter of minutes (or, at most, hours) from a last good state of an existing system in the case of a catastrophic failure where an entire data center is lost.
Because such definitions can be so loose, it’s useful to define some terms for discussing business continuity for IBM PureApplication System. Wikipedia, in its article on disaster recovery, refers to two key measurements that are important to understand:
- Recovery point objective (RPO) is the measure of the maximum time period in which data might be lost from an IT service due to a major incident.
- Recovery time objective (RTO) is the duration of time within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
So, for example, a business might define its RPO to be 12 hours; this means that two backups need to be taken a day, at a minimum. Likewise, a business might define its RTO in hours or days for systems that are important, but not absolutely business critical, or in minutes for systems which are absolutely central to the functioning of a business. What's more, you have to consider that different aspects of your business might have dramatically different expectations in terms of RPO and RTO. Balancing those requirements against the technical complexity of enabling disaster recovery is primarily what makes the subject so challenging.
This article surveys the different ways in which business continuity can be maintained with IBM PureApplication System. The discussion includes how these various strategies meet different RPO needs, and how you can meet your RTO objectives using PureApplication System. The features of the system for backup and recovery will be discussed, along with what you will need to do in order to recover your workload data in the case of a system failure. We'll also examine the details of a major new feature for simplified disaster recovery for PureApplication System version 184.108.40.206 and talk about how you can put that into place in your environment. Finally, we'll look at some of the other aspects of business continuity, such as how to achieve business continuity even when you can't take advantage of all the new features in PureApplication System.
In order to be realistic about what could be achieved in version 1.1 of IBM PureApplication System, some limitations had to be set on the disaster recovery measurements that could be supported.
First, discussions of RPO had to be limited down to either about a second or just under a second, using techniques discussed later in this article. More aggressive RPO requirements will require the use of synchronous replication, which can bring RPO down to very nearly zero; ultimately, additional hardware and software outside the PureApplication System is required to achieve this stringent a goal.
Second, in terms of what can be achieved at present, RTO times must be measured in (at best) minutes or (more realistically) hours. At a minimum in the RTO, you must include data sync time, system restart time, and network redirection time; all of these together result in a total time that is measured in multiple minutes, with actual times varying by specific application. Limitations also had to be put on how this disaster recovery approach can be supported. For example:
- In a few cases, there are differences in how disaster recovery and other business continuity approaches work for virtual applications and virtual system patterns. These differences will be pointed out where they exist.
- We can only discuss disaster recovery between two PureApplication Systems (what we call a homogenous approach). We have been asked about alternative architectures for disaster recovery that would involve disaster recovery sites built on either a customer-managed VMWare infrastructure or IBM PureFlex™ with IBM Workload Deployer (a heterogeneous approach), but due to the difficulties of managing these images and infrastructures, such environments will not be considered here.
- The PureApplication Systems must be identically configured. Identically configured systems are defined as those which have the same model number (size and hardware platform) and are configured with the same number of compute nodes. They must also be at the same fixpack levels.
Aspects of disaster recovery for PureApplication System
There are three different aspects that have to be understood and planned for in order to use a two-machine (homogenous) solution for disaster recovery in PureApplication System (Figure 1).
Figure 1. Aspects of disaster recovery for PureApplication System
Three aspects of a system running on PureApplication System have to be moved from one machine to another in order to restore functionality to its previous configuration. They are:
- Management data: Definitions of the patterns and images used in building virtual systems or virtual applications.
- Program state: The state of the databases, messages on IBM WebSphere® MQ queues, configuration information in the WebSphere Application Server DMgr, transaction logs, and so on.
- Network configuration: Redirects traffic from one set of nodes to another.
Let’s examine the movement of the management data first. Management data includes machine-specific parameters related to the customer networking, such as cloud group configuration, VLAN ranges, IP address ranges, and so on. All of that configuration information is stored locally in the storage of the management nodes of PureApplication System. These management nodes also contain all of the information about the patterns that have been imported into the machine from IBM PureSystems™ Centre, patterns and images that have been created or imported by the users of this machine, and also information about what patterns have been deployed and where.
When a pattern is deployed into PureApplication System, that management information is combined with the general aspects of the pattern, adding in the machine-specific aspects. For example, when an IBM WebSphere Application Server DMgr image is deployed to a specific IP address, its files after deployment would contain the IP addresses of the other images in its cell that are federated into it as part of the deployment process. There are other instances of this machine-specific information that are also stored as part of the deployed image in the local SAN storage of PureApplication System. That means that if you want to reuse all of the information from the volume that is stored on the SAN of the primary PureApplication System on the secondary system, then the IP addresses on the secondary system have to exactly match those of the primary. For this you must use what is called an IP Takeover approach.
In a nutshell, this is why any purely storage-centric disaster recovery schemes require some specific aid from the PureApplication System firmware itself. If your primary and your secondary machines are set up differently — for example, with different management data, and specifically set up with different cloud groups mapped to different IP groups on different VLANs — then if you simply copied the contents of the volumes from one machine to another, the machine-specific configuration information in the volumes would be different (and disconnected) from the appropriate management data of the new machine. This point will strongly influence your options for data copying; you have to either be able to live with the fact the volumes contain configuration-specific data and handle that through IP takeover, or you have to come up with ways to mitigate this by either not copying the configuration specific data, or by changing it in the copy.
At this point, you need to consider the quality of service attributes (RPO) around the recovery of the data from the source system.
Different clients and different applications within those clients’ systems have different levels of recovery needs. At one end are those applications that only require that systems be up and ready to take new requests, and at the other end are those that require a system that is in a “recent” state; for example, those that have a relatively long RPO. For instance, it might be enough for some retail clients that the e-commerce system be brought back to the latest set of catalog items and then enable new customer traffic to create new orders – assuming that the customer will not be too disturbed if in-process items in a shopping cart are lost in a disaster. In other businesses, like banking, the recovery options may be much more stringent and have much smaller RPOs – this might require that in-flight transactions be captured and reconstructed either automatically (through recovery of transaction logs) or manually through a reconciliation process after recovery.
Given that there is a spectrum of requirements, you will usually need several different levels of recovery of program state information. These different levels are often product-specific, so while we can describe different approaches for different requirements, there are few general statements that can be made about all products. As a result, we will only make statements about how we can address recovery for the products that are entitled to PureApplication System (for example, WebSphere Application Server, IBM DB2®, and so on) and WebSphere MQSeries®. All other products should be broadly compatible with at least one of the disaster recovery approaches discussed here, but there might be product-specific differences that cannot be addressed in this article.
Based on the client’s RPO and RTO requirements, you need to choose an appropriate mechanism for program state transfer. These mechanisms will be discussed later, but in all of them, there is some need to either restore the data, or validate that the program state data is in place and available so that the backup pattern can begin working from it. Finally, you will need to make the appropriate network changes to direct traffic onto the newly restored versions of the patterns on the target machine.
Approaches for program state transfer
In general, you have to look at four possibilities for the transfer of program state information between the source and target system. Arranged in order of decreasing RPO and (effectively) increasing cost, they are:
- Backup and restore
- Disk replication
- Host-based replication (such as file replication)
- Shared file-system
Each of these options works best within certain ranges of RPO. A very rough outline of the appropriate RPO timescales for each solution is shown in Figure 2.
Figure 2. RPO scales
The following sections take a closer look at each of these approaches.
Backup and restore
What you must do on a regular basis is export the critical patterns and image configurations from the source system. The process for importing and exporting patterns is nearly identical between Workload Deployer and PureApplication System, and is described clearly in this blog. However, in general terms, you can't always assume that you can directly import the full PureApplication System configuration onto the target system.
There will at least need to be some filtering and mapping between the source and target systems to address differences such as cloud group IP configuration, user access control lists, and so on. This mapping can be done in a scripted (automated) way using the command line interface facilities described in the Information Center (and information on how to do this can be provided by IBM Software Services for WebSphere as part of a services offering.
The most fine-grained approach for returning a system to a known state is through backup and restore. While the capability for external backup of a volume is not built directly into PureApplication System, most existing backup and restore solutions, such as IBM Tivol® Storage Manager, can be used with the images running on PureApplication System. Tivoli Storage Manager and other backup solutions have the advantage of being simple to setup and administer – in the case of Tivoli Storage Manager, there is only the installation of a backup agent into each image that is to be backed up (which can be done through a image deployment script package or through extend-capture), and some minimal post deployment configuration that has to be done to connect the agent to a Tivoli Storage Manager server and set up a backup schedule. In all cases, the Tivoli Storage Manager server would run separately from the PureApplication System machine, and the Tivoli Storage Manager server storage should be replicated offsite as well.
However, the downside of any backup and restore solution is in the relatively long RPO. While you can use a solution like Tivoli Storage Manager to capture the last changed state of an image (for example, a WebSphere Application Server image after the last configuration change, or a DB2 database after a nightly data load), it is not an appropriate solution for situations with more stringent RPO because of the time lag between backups (because backups are not zero-time operations). However, in those situations where this is all that is needed, it remains a cost-effective and simple solution.
Finally, even with a backup and restore solution, there are some limitations on what can be backed up. So, care needs to be taken in choosing exactly what files can be backed up and restored; this is generally true of all software products.
In terms of RTO, backup and restore is probably the least responsive solution. Backup software takes time to run (usually measured in minutes, if not hours, depending on the size of the data being backed up), but can usually occur while the target system is running and does not require an outage. Likewise, restore is not instantaneous, but also could take minutes or hours to run, again, depending on the size of the data being restored. Finally, you have to restart the system on top of the restored image data, which again takes time measured in minutes.
In the PureApplication System V1.1 firmware, a new feature has been added that enables simplified disaster recovery using full disk replication of the entire built-in storage array from one PureApplication System to another. This solution provides coverage for all application workloads and applies to all applications and patterns. It is implemented using a combination of IBM SAN Volume Controller hardware disk replication between PureApplication System racks and real-time cloning of the primary PureApplication System management infrastructure to the backup PureApplication System.
The solution has been developed to provide for both planned and unplanned failover, as well as fail-back after failover to the original rack after the cause of the original outage is addressed. The recovery characteristics of this solution is that it features very low RPO (measured in hundreds of milliseconds with the distance between racks being a maximum of 8000 km) and a reasonable RTO. The maximum projected RTO with this solution is six hours but in practice it could be shorter than that.
Before you begin setting up the disaster recovery environment, there are a few out of band manual tasks that are required to prepare the racks for setup. The first, of course, is to install the firmware version 1.1 on both the planned primary and secondary rack. As part of this installation on existing pre-V1.1 racks, there is a step that must be taken to reconfigure the V7000, which will require you to take a short production outage while this occurs. In new racks that ship with version 1.1 installed, this will not be necessary.
The second out of band step is to set up the two racks for inter-PureApplication System IP network and fiber channel connectivity. Be aware that a non-negotiable requirement of this approach is that the two machines in the two data centers must be set up such that they can both be connected to the same VLANs and share the same IP address pools. If your data center networking will not allow this, then this approach as a whole will not work in your environment.
Once the manual setup steps are complete, the disaster recovery capability is managed from a new disaster recovery page within the System menu. From this page, an administrator can define and enable disaster recovery, monitor data replication, and execute a failover operation between PureApplication Systems.
As shown in Figure 3, an administrator creates disaster recovery profiles on the primary and backup PureApplication Systems. To create a disaster recovery profile, the user ID should have full permissions for Disaster recovery administration.
When creating the disaster recovery profile, the administrator specifies the peer management location, which is the peer’s management IP address or DNS name. Also specified is the user ID and password for establishing the trust relationship. This user should have full permissions for Security administration.
Figure 3. Create a new disaster recovery profile
After creating the disaster recovery profile, notice that the state of the profile is Defined, which is indicated by the ”/“ next to the profile name (Figure 4). There are three sections on the disaster recovery profile: enable, monitor, and failover. The latter two are disabled because the profile has not been enabled.
Figure 4. Defined disaster recovery profile
The next step is to validate that the PureApplication System is ready to participate in a disaster recovery relationship. Figure 5 shows a disaster recovery profile which has been validated, and a message indicates that the validation succeeded. Notice that the state of the profile changes to ”validated,” indicated by the green check mark.
Figure 5. Validated disaster recovery profile
Next, you enable the disaster recovery role. The backup PureApplication System must be enabled before the primary PureApplication System. If the backup is not enabled then the enable action on the primary fails.
On the profile page, the administrator selects the button labeled Enable, and is then prompted to choose the role, either Primary or Backup.
After enablement is complete for both PureApplication Systems, then initial copies begin for the disks and management data. This process can take a long time to complete depending on the network speed and size of the data. After the initial copies are complete, the changes are then replicated as they occur. Replication status can be monitored at any time on either of the racks.
Figure 6 shows the primary disaster recovery profile after it has been enabled. A message indicates that disaster recovery is enabled in the primary role. The role field also shows that it is in the primary role. Notice also that the Monitor and Failover sections are now available.
Figure 6. Enabling disaster recovery - primary role enabled
Replication status monitoring is provided on both the primary and backup and the information displayed is similar. A summary of the replication monitoring status is displayed on the disaster recovery profile panel shown in Figure 7. There are two types of replication monitoring: management data monitoring and storage monitoring. Detailed monitoring information for both is obtained by clicking on the View details link.
Figure 7. Monitoring disaster recovery replication status
Now, let’s discuss planned and unplanned failover.
An administrator initiates planned failover on the primary disaster recovery profile, then follows up with the actual failover on the backup rack. As shown in Figure 8, on the primary profile in the first step, the administrator can select View details and a reminder dialog of manual actions to perform is dispalyed. These steps, which stop disk updates on the primary, are to quiesce (inhibit) the compute nodes to ensure no new deployments take place, verify that the replication status shows that the initial copies are complete, stop the workloads on the system, and update the external routers to stop advertising the rack IP addresses.
Figure 8. Planned failover - prepare to failover on primary
As shown in Figure 9, the administrator performs a planned failover on the primary by selecting the button Start Failover followed by selecting OK. After completing the operation, the disaster recovery profile role shows “(none)“ and the state shows returns to “validated.”
At this point, a failover on the backup PureApplication System must be performed before workloads can be started on the backup.
Figure 9. Planned - failover on primary
For both planned and unplanned scenarios, it is a two step process to initiate the failover on the backup PureApplication System. In the backup DR profile (Figure 10) the administrator selects View details, selects the type of failover, either Planned or Unplanned, in the dialog. A reminder of the manual actions to perform on the backup, including adding compute nodes to the cloud groups and configuring the top of rack switches, are displayed. Once these manual tasks are complete, the administrator selects OK. Adding a compute node to a cloud group will activate the cloud group.
Figure 10. Planned and unplanned - Prepare to failover on backup
As shown in Figure 11, when the administrator is ready to perform the failover on the backup DR profile, the administrator selects the button Start Failover then selects OK.
Figure 11. Planned and unplanned - failover on backup
After completing the operation, the profile role shows “(none)“ and the state shows ”validated.” At this point, the administrator can start the selected workloads from the workload console and update the routers to advertise rack IP addresses.
This approach has a number of advantages, but there are a few disadvantages as well. The primary advantage, as you've seen above, is that it is very easy to set up. It also will completely replicate all the workloads on your source system — there are no limitations on which patterns it will work with, or what custom workloads you develop that it can support. However, there are a few drawbacks also; the first and foremost is that this solution requires that the target rack be only used for backup; you cannot run any workloads on that rack while it is being used as a backup rack. Second, this is a "hot-cold" solution. While it has an excellent RPO (as described above, measured in milliseconds) and it can work over distances of thousands of kilometers, because the workloads on the second rack must start up entirely upon takeover, the RTO time is limited by the time it takes to restart those pattern instances. This time will vary by how long it takes to restart your workload, but will probably be measured in hours; an RPO of six hours or less is guaranteed, but in many cases, that figure can be significantly shorter.
Host-based replication (or more precisely, file or volume replication) solutions can help in those situations in which there is a more stringent RPO than backup and restore solutions, but where you cannot take advantage of the built-in disk replication solutions provided by version 1.1 of the PureApplication System firmware. In this general case, you put in place software that will capture or record all changes to the application’s files on a source server and then replicate those changes to a file system on a target server.
There are several potential open source host-based replication solutions that do this on one or more operating systems. One such example is Rsync, a GPL-licensed tool to remotely synchronize Linux® files and directories with one another. Another potential solution is DRBD (Distributed Replicated Block Device), which works at the block device level to replicate all changed blocks from a primary volume to a secondary volume. The drawbacks of these solutions are that they are only available for some of the operating systems supported on PureApplication System, and they would require substantial additional configuration to setup and manage. The bottom line is that while it is possible that you could build a solution around either approach, the result would be extremely limited in practice.
An alternative solution you can choose is Double-Take Availability from Vision Solutions (see Resources, an IBM PureApplication System partner. Double-Take Availability is supported on RedHat Linux, AIX®, and Windows®. We will use Double-Take Availability in the detailed example below. Implementing disaster recovery with Double-Take Availability requires a three-step process:
- Update the prerequisites. Installing Double-Take Availability currently requires the presence of the Red Hat Enterprise Linux install image because additional packages that are not present in the base Linux OS image for PureApplication System need to be installed during the Double-Take Availability installation.
- Attach an installation script for Double-Take Availability that installs the Double-Take Availability software into the operating system of any image running any combination of middleware.
- Develop or reuse and attach middleware-specific scripts that tell Double-Take Availability exactly which files in which directories to synchronize and how (what frequency, and so on). Because, for example, the WebSphere MQ directory structure is different than the WebSphere Application Server directory structure, this needs to be middleware-specific.
When you follow this approach, you can construct virtual systems where one instance of the virtual system pattern is designated as the source and a second instance of the same pattern, running on another rack, is designated as the target. The middleware only runs on the master images. It is turned off on the slave images until a failure occurs, and then the middleware on the slave images is activated. While it is possible that this general approach might also work for virtual applications, this scenario is beyond the scope of this discussion.
File-replication solutions like Double-Take Availability can provide extremely good RPO, down to times measured in seconds. What's more, the RTO time can be quite good, since you are implementing a "hot-warm" solution instead of the "hot-cold" solution that disk replication requires. However, given that it is an asynchronous file-replication solution (as are all of the others considered above), it cannot meet a zero or near-zero RPO requirement. For that we must consider the shared file system approach.
Shared file system
Of course, the best solution for program state transfer would be one in which no loss of data would occur. The only solution for that is a shared file system approach, where both the target and the source images would read their data from the same file system. In effect, this describes a "hot-hot" solution. The problem is, however, that this shared file system needs to be redundant and clustered across the racks so that a failure of a single rack (either source or target rack) will not result in the failure of a system. Thus, this is not presently available in the current version of PureApplication System using only the capabilities of the two PureApplication systems we are considering.
In order to consider this, then, we have to rely upon external storage not provided by PureApplication System. PureApplication System can support several options for connectivity to external file systems. In particular, it supports NFS mounts to external file systems (which would work in the case of some products, such as WebSphere MQ).
WebSphere MQ presents an interesting problem to solve when addressing disaster recovery; the high availability and disaster recovery approaches taken by many of our current MQ clients often rely on solutions like HACMP, which are not applicable to PureApplication System. Instead, as they migrate onto PureApplication System, they should move to a relatively new feature introduced in WebSphere MQ 7.0, called MQ multi-instance support.
Figure 12. MQ multi-instance
What MQ multi-instance provides is the ability for two queue managers to run in an active and standby mode: if the “active” queue manager fails, then the standby can continue to service requests for the queues and topics defined on that queue manager. In order for this to work, the two queue managers need to have access to common shared storage. MQ supports NFS v4 for this common shared storage, which you can also use from within PureApplication System.
Consider the possible scenario in Figure 13.
Figure 13. MQ multi-instance recommended configuration
Here you have two queue managers, one running inside each IBM PureApplication System system connecting to an NFS server that you have set up inside a third image running as part of the first pattern. In this way, the two queue managers provide high availability for a portion of the pattern. However, even in this situation, you still have the possibility that the NFS server could fail; all you have really done is move the problem — in order to remove all single points of failure, you have to specify that an external NFS be used that is highly available.
Many users do not have a highly available NFS infrastructure. In that case, we would recommend using a highly available shared file system such as IBM General Parallel File System (GPFS), which is a synchronous replication solution that is also compatible with Red Hat Linux and WebSphere MQ, and should work in the scenario described above. That solution requires the GPFS file system be set up on a set of highly available, external hardware. (See Resources for information on setting up and configuring a set of highly available GPFS servers outside PureApplication System.)
This articles reviewed several different ways in which you can meet your RTO and RPO objectives using features in IBM PureApplication System.
The authors thank Peter Van Sickel and Arunava Majumdar from IBM Software Services for WebSphere, Shaun Sellers from Vision Solutions, and Thomas Alcott from the Worldwide WebSphere Infrastructure team for contributing to, reviewing, and commenting on versions of this article.
- IBM PureApplication System Version 1.1 Information Center
- IBM Redbook: Implementing the IBM General Parallel File System (GPFS) in a Cross Platform Environment
- An introduction to MQ Multi-instance queue managers
- High availability with the Distributed Replicated Block Device
- Vision Solutions
- Video: Kyle Brown: IBM PureApplication System Innovations for Business Continuity
- IBM developerWorks WebSphere
Get products and technologies
- Blog: Exporting and importing your application on IBM Workload Deployer
- Follow developerWorks on Twitter.
- Get involved in the developerWorks Community