Asynchronous replication of WebSphere Process Server and WebSphere Enterprise Service Bus for disaster recovery environments

This article describes an environment that is based on using a disk replication system in asynchronous mode. You can include this environment in a disaster recovery plan that includes a secondary data center using IBM WebSphere Process Server or WebSphere Enterprise Service Bus.

Share:

Charlie Redlin, WebSphere Process Server Architect, IBM

Charlie Redlin is an architect on the WebSphere Process Server development team in Rochester, Minnesota. He has worked in the development of WebSphere clusters and network deployment environments for many years. He currently works in a bring-up lab and is focused on the deployment and integration of WebSphere Process Server.



24 September 2008

Also available in Chinese

Introduction

When a secondary data center is a part of your disaster recovery plan and IBM®WebSphere® Process Server or IBM WebSphere Enterprise Service Bus are a part of your primary data center, then you need a topology and ongoing maintenance activities that will allow that disaster data center to be useful. While there are multiple topologies that can satisfy those needs, this article describes a topology that utilizes a disk replication system.

RTO and RPO

Recovery Time Objective (RTO)
The time required to restore the environment after the primary site becomes unavailable. This is the maximum time your business can be unavailable.

Recovery Point Objective (RPO)
The restore point of your business state. A recovery point objective of 15 minutes means the snap shot used to restore the environment is no more than 15 minutes old. This is the maximum amount of data that your business can afford to lose.

With this disk replication system, snapshots of the state of the primary data center are captured and asynchronously transferred to the disaster data center. This type of topology allows you to have a short RTO as well as a relatively short RPO.

This article is intended for the architects and planners of the operations center. Prior to reading this article you should have a good understanding of the Process Server deployment environments and the WebSphere Application Server Network Deployment environments. In addition, you should have a familiarity with the science (or is it art) of high availability, disaster recovery, continuous availability and the other qualities of service associated with information technology operations.

Overview of the environment

The following drawing (see Figure 1) depicts a topology with a primary data center and a disaster data center. These data centers run in an Active/Standby mode where the disaster data center is idle until the primary data center is not available. Of course, the hardware and other computing resources of the disaster data center do not need to be idle while the primary data center is active. For example, they could be used for things that would be of a lower priority should the disaster data center ever be required; like a test environment or a different set of applications.

The number of LPARs depicted in each data center is for illustrative purposes only. The actual number and purposes of the LPARs in your data center should be defined based on the needs of your application. What is important is the existence of a SAN, the division of the data within the SAN and the fact that the data in the SAN is replicated to a SAN in the disaster center.

LPAR

A logical partition of a computer. It can be thought of as an image of an operating system. When the computer has only one OS image active at a time, then the term LPAR can be replaced with the term computer.

In addition to the sharing of the data between the data centers, the configuration of the data centers needs to be similar enough that the data can be used in both data centers.

Figure 1: Replicated disks as the basis for a disaster data center
replicated disks

You will need to capture three types of data in the original data center and move that data to your disaster center:

  • The install data or the data that is associated with the WebSphere products.
  • The configuration data or the data that is associated with your applications and the resources needed to run them.
  • The run data or the data that is associated with specific instances of your processes or other business state.

Once this data is captured, it needs some modification in order for it to be useful in the disaster data center.

Basic operational flow

This section contains the procedures that are needed to manage the data that is associated with this environment. There procedures include those associated with the management of the primary data center, the ongoing data changes, and those associated with bring up the disaster data center.

As you go about your normal management of the primary data center, you will perform actions that need to be reflected in the disaster data center. These operations and the associated actions needed to get the data to the disaster data center are indicated in the Table 1 below:

Table 1. Actions that result from normal management of the primary data center

Type of change madeRequired procedure
Installation changeTake a snapshot of the installation data and transfer it to the disaster data center.
Profile changeTake a snapshot of the configuration data and a snapshot of the installation data and transfer them to the disaster data center.
Install or uninstall an applicationTake a snapshot of the run data prior to saving the changes. After saving the changes take a snapshot of the run data and a snapshot of the configuration data.
Any other configuration changesTake a snapshot of the configuration data and transfer it to the disaster data center.

Continuously take snapshots of the run data and transfer them to the disaster data center.

Use these steps to bring up the disaster data center:

  1. Determine the snapshot to use for each data volume.
  2. Make that data available on the disks.
  3. Make the disk image available to the LPARs of the disaster data center.
  4. Change the host names or IP addresses (if required).
  5. Start the database and WebSphere administration processes.
  6. Start the WebSphere clusters to recover all in flight work (first messaging, then support then application deployment).
  7. Open up the data center for new work.

Disaster recovery configuration considerations

In addition to the normal configuration of your environment, you will need some configuration considerations that are specific to the disaster recovery requirements. These are mostly made up of disk management types of things but also include some WebSphere configuration. These topics are discussed in this section.

Disk management

You should look at the data in your environment as consisting of three related but independent consistency groups:

  • The install data for each of the LPARs in the cell is included in the same consistency group.
  • The configuration data for each LPAR in the cell is included in the same consistency group.
  • The run data for each LPAR in the environment is included in the same consistency group.

The rationale behind dividing the data up into these independent consistency groups is that the actions that make the data inconsistent are different for each group. If you did not divide up the consistency groups, you would not be able to use your run data in your disaster data center for any period that your install or configuration data is inconsistent.

The rationale behind including the data for all of the LPARs in a single consistency group is:

  • It is required for the run data. You have no choice.
  • You are able to limit the number of things to be managed by combining them. This becomes more important as the number of nodes increases.

Installation data

The installation data of your system includes the binaries of the WebSphere products.

You will want your disaster data center to have the same version of the WebSphere products along with the specific fix packs and other fixes as the original data center so that your applications run consistently in the disaster data center.

This installation data consists of the install root, the maintenance directory of the update installer and the install registry files. The three install registry files associated with WebSphere Process Server is: a .nifregistry file (introduced in version 6.1), the vpd.properties file, and the native or OS install registry. The .nifregistry file is found at /opt/.ibm/.nif/.nifregistry, in a linux system and in similar locations for the other platforms. The vpd.properties file is found in the root directory in linux and in similar locations for other platforms. The native install registry is managed by the individual operating systems.

You include this data on a disk replication system with the following configuration:

Configuration in the original data center

  • Creating directories needed for mounting the SAN, such as:
    /opt/ibm/WebSphere/install and /opt/.ibm/.
  • Mount the SAN drive.
  • Create directory on the SAN for the .nifregistry file and put the link in place to get there.
    (/opt/ibm/WebSphere/install/.nif and ln /opt/ibm/WebSphere/install/.nif /opt/.ibm/.nif)
  • Installing WebSphere Process Server or WebSphere Enterprise Service Bus to a subdirectory of the mount point (for example /opt/ibm/Websphere/install/ProcServer).
  • Installing the update installer to a subdirectory of the same mount point (for example /opt/ibm/WebSphere/install/UpdateInstaller).

Configuration in the disaster data center

  • Create similar directories, mount points and symbolic link.
  • Load the disaster site recovery scripts or procedures for the install replication volume

Development

  • Scripts or procedures to mount the drives on the disaster site when needed.

One example of this configuration is shown in the following drawing. In the drawing, is the directory structure that maps to the disk system. However, the location of the .nifRegistry file may be different on different platforms and when installed as non-root. This drawing also shows that the WebSphere installation for all of the OS images of the data center are included in the same replication volume.

You should not take a snapshot of this replication volume according to some schedule but rather you cause a snapshot to be taken each time one of the install images is changed. Any snapshot that is taken while the install is in progress will capture an unusable view of the install image. When you try to use such a snapshot in your disaster center, you will get unpredictable results. To prevent these troublesome snapshots, you should take snapshots of the install volume using an "on-demand" model and take a snapshot every time the installation data is altered.

The types of actions that cause a change to the installation data include:

  • Installing a new instance of WebSphere
  • Applying a fix pack to the WebSphere instance
  • Applying an ifix to the WebSphere instance
  • Installing UpdateInstaller
  • Updating the UpdateInstaller

With all of the install images on a single replication volume, you are able to exactly duplicate the original data center configuration in your disaster data center. If your disaster occurs while you are rolling out an installation change to your environment, then your environment in the disaster data center will allow you to continue rolling out the installation change at the pace you define.

Figure 2. Directory structure of the installation data
Directory structure

The usefulness of the install registries is, at least, enabling the uninstall mechanisms. If you choose to manage your disaster data center with mechanisms that rely upon these install registries, you will want to install WebSphere on the LPARs of the disaster center. After installing, you are able to replace according to the replication model presented earlier in this section.

Configuration data

The configuration data of your system describes the WebSphere environment.

You will want your disaster data center to have the same configuration as your primary data center so that any recovery can be complete.

Your configuration data includes the profile root and a couple of other files that are found in subdirectories of the <install root>. These other files include: properties/profileRegistry.xml, properties/fsdb/* and properties/Profiles.menu. There are also the files that are found in the logs directory that would contain any errors related to profile actions and might be useful in the disaster data center.

You include this data on a disk replication system with the following:

Configuration in the original data center

  • Create the directory for mounting the SAN, such as: (/opt/ibm/WebSphere/profiles).
  • Mount the SAN drive.
  • Create the profiles in a subdirectory of that mounted directory (/opt/ibm/WebSphere/profiles).

Configuration in the disaster data center

  • Create similar directory for mounting.
  • Load in the disaster site recovery scripts or procedures for the configuration replication volume.

Development

  • Scripts or procedures for mounting disk.
  • Scripts to deal with host name changes.
  • Scripts to start up the admin processes of your disaster center.
  • Scripts to start up your disaster center resources.

Procedures

  • You need to consider profile changes as also consisting of installation changes. This means that the actions of create profile, delete profile, add node, remove node, should also trigger a snapshot to the installation data. This need to take a snapshot of the installation data for these configuration changes exists because some of the files that are altered in for these changes are contained in the installation data. (Examples of these files were listed earlier and are found the subdirectories of the install root.)

One example of this configuration is shown in the following diagram. In the drawing is the directory structure that maps to the disk system. The files that sit outside of the profile root are ignored because they become irrelevant when you include a snapshot of the installation volume with profile action changes.

Figure 3. Directory structure of the configuration data
Configuration data

You should not take a snapshot of this replication volume according to some schedule, but rather you cause a snapshot to be taken each time one of the configuration images changes. A configuration image changes when configuration changes are saved and when the configuration changes are replicated to a node. Any snapshot that is taken while the configuration changes are "in flight" will capture an unusable view of the install image. When you try to use such a snapshot in your disaster center, you will get unpredictable results. To prevent these troublesome snapshots, you should take a snapshot of the configuration volume using an "on-demand" model and take a snapshot every time the configuration is altered.

With all of the install images on a single replication volume, you are able to exactly duplicate the original data center configuration in your disaster data center. If your disaster occurs while you are rolling out a configuration change to your environment, then when you restart your environment in the disaster data center you continue rolling out the configuration change and are not forced to roll the changes forward faster than you expected.

Run data (Ongoing replication)

The run data of your system is the information that is involved with the applications. This includes the transactions and the business process state.

You will want all of the components of your run data to be consistent so that your disaster site will have consistent data. Because your run data is changing continually, it is not reasonable to expect that the disaster data center will have the same state as your primary data center, unless you are using synchronous replication. In many environments, synchronous replication is not a valid option because of the performance impacts of a synchronous implementation.

The run data consists of the WebSphere transaction logs; some of the files are associated with the process server databases and some of the files are associated with any other resource managers. The files of interest are those that reflect the current state of the database tables, the current state of the transactions and any other data that is managed by the resource that reflects the current state of the resource. These files will differ depending upon the database product or resource manager and vendor that you have chosen. The database tables that are included in this run data includes at least all of the tables that are associated with your process server configuration (persistent stores for messaging engines, Business Processes, Human Tasks, Failed Events, Relationships, and so forth).

You include this data on a disk replication system with the following configuration:

Configuration in the original data center

  • Create directories needed for mounting the SAN, like /opt/ibm/WebSphere/tranlogs on the WebSphere machines and /opt/ibm/WebSphere/database on the database machine.
  • Mount the SAN drive.
  • Configure WebSphere transaction service to use this mount for its transaction logs.
  • Configure the database to use this mount for its database and logs.

Configuration in the disaster data center

  • Create similar directory for mounting.
  • Load the disaster site recovery scripts or procedures for the run data replicated volume.
  • Install and configure the data database catalog to find the appropriate files.

One example of this configuration is shown in Figure 4. This drawing shows the directory structure that maps the disk system. The data for all of the run data needs to be included in the same snapshot and that snapshot needs to be taken at an instant of time. Your performance needs may require you to place the database log files on different disk arms than the database data or indicate some other placement needs. You will need to work with your database vendor, your SAN vendor, and your operating system to configure the optimum configuration for your requirements. As you work with your SAN vendor, you need to make sure that the write order is preserved in the snapshot and its replica. Consistency of the run data only exists when the snapshot of the data is write order consistent.

Figure 4. Directory structure of the runtime data
Runtime data

You will want to set some sort of schedule for the snapshot that is taken of this volume. The schedule will determine if you can meet your recovery point objective (RPO). For example, if you have a RPO of 30 minutes, you will need to capture a snapshot at an interval just short of 30 minutes. You need to consider the time it takes to actually take a snapshot and transfer it to the disaster data center. But your SAN provider can help you sort out all of those details.

WebSphere configuration

Security

You will create and write files on one system and then read those same files on a different system. Because of OS level security you will need to make sure that the users and groups for the systems are consistent. These operating systems can actually manage the ownership and permission for file access with an ID rather than with a name. You need both systems to recognize the same ID.

When you use LDAP to manage your security, you will need to ensure that the ID provided by the LDAP server of the original datacenter is the same as the ID provided by the LDAP server of the disaster center. When you use OS security, you need to ensure that both systems convert the same names to the same IDs. You can enforce this consistency by using IDs when you manage the identities on the operating systems. (For example, if you are using a unix operating system, you need to use the –u option of the adduser command.)

Host Names

Because the recovery data center is physically different from the original data center you may want the OS images in the recovery data center to have different host names than those used in the original data center. There are two types of host names that you will need to consider: the host names where WebSphere is installed and the host names of the database.

The host names of the LPARs that host the WebSphere profiles are contained in a set of configuration files with the name serverindex.xml. There is one of these files in every profile for every node in the cell. For example, a cell with 15 nodes, and one of those nodes has the name Node1, there will be 16 copies of the serverindex.xml file for Node1. All 15 nodes and the dmgr will have a config directory, and each of those config directories will have a directory for Node1. In that directory, will be a file with the name serverindex.xml. The host names that are included in that file need to be changed from the names used in the original data center to the names used in the disaster data center. If you are careful with the naming, then you can create a script that will modify the names automatically and it can run at the same time as the script that copies the profileRegistry.xml file. An example of this naming scheme would be to name a host DC1xxxxx in the original data center and DC2xxxxx in the disaster data center. You can also play games with short names or even use the same names and just change the domain names. However, you resolve the names they need to be contained in the serverindex.xml files.

The host name of the LPAR that hosts the database is configured in WebSphere and is contained as a binary entry in the transaction logs. As a result, there is no way to change a configuration value to change the host name. What you need to do instead is to either use the same IP in both data centers for the database host or to use some other mechanism that will route work to multiple destinations. Some databases have a feature that allows you to configure multiple IP addresses for the same database. When the primary destination is not available, the database feature will direct the communication to the disaster data center (for example: DB2 HADR or Oracle RAC). Do not confuse this with the features of a database to asynchronously replicate, because the database feature to replicate data asynchronously is insufficient and should not be employed.

File synchronization

Because you don't have control of when the disaster will occur, you may need control of how the nodes and the dmgr are synchronized in the disaster center. Because of this requirement, you will need to turn off resynchronization of the node on start up. You can turn off the auto resync on node agent start by setting the configuration of the file synchronization service of the node agent.

Ongoing maintenance

The procedures that are required to manage this environment on an ongoing basis include the management of the data and then starting up the disaster data center to verify that your environment is complete. The procedures to manage the data are a part of your management of the snapshots. The disaster data center should be started as a part of an ongoing test of your disaster plan.

Managing the snapshots

You need two models to manage your snapshots:

  • on-demand model
  • regularly scheduled model.

You will want to capture a snapshot of your install when you make install changes. An install change consists of applying a fix or a fix pack, or creating a new install image.

There is a period of time when your install image is not consistent and any snapshot taken during that time will not be useful to you in your disaster data center. This time period starts when you being your install change. It ends when the install change is complete. The last step of an install change should be to capture a snapshot of the install change. If you apply install changes with scripting, then you should also include a last call in that script to interact with your SAN to capture and copy a snapshot.

If you roll out your installation changes, then you should take a snapshot as the change is rolled out to each instance. This behavior will enable you to bring up your disaster data center without requiring that you first complete the rollout of the install change.

Likewise, there is a period of time when your configuration image is not consistent and any snapshot taken during the time of inconsistency will not be useful to you in your disaster center.

dmgr. This time period begins for the dmgr when you indicate that a configuration change should be saved and ends when the save is complete.

Node. The time period begins for each node when you indicate that the configuration change should be synchronized with the dmgr and ends once the synchronization is complete.

Profile is created, deleted, augmented or unaugmented. This time period also begins when a profile is created, deleted, augmented or unaugmented, the period ends when the profile command is completed.

Profile is federated or a node is removed. This time period also begins when a profile is federated or a node is removed from a cell and ends when the federation command is completed.

The last step of a configuration change should include a command to take a snapshot of the configuration data. In addition, if the change involves a profile command or a profile federation command, then a snapshot of the install image should also be taken.

The run data is the one replication volume that is not taken using the "on-demand" mode and should be set up for periodically capturing a snapshot. It is this period that determines the recovery point (RPO) of your disaster data center. With your install and configuration changes, you can create build processes so that your disaster center will not have any configuration or installation data missing. However, it may be hard to achieve the same "no data loss" environment with your run data. It is changing fast enough and you have chosen to take an asynchronous replication so your disaster data center will lose all data that is processed by the primary data center after the time of the last snapshot and before the failure.

In fact, you will likely want to keep a couple of snapshots worth of data around just in case the last snapshot was not useable. The state of the data on a disk may not be useable because when the snapshot was taken the data on the disk was considered to be crash consistent. In other words, the data on a disk that is restored to a snapshot looks like the data on a disk after an operating system crash. When this makes some sectors or inodes invalid, you will end up losing valuable data. If that is the case, you may need to restore to a previous check pointed time. Check with you SAN provider and operating system for details.

Testing the environment

The worst time to find out that there is something missing in your set up for disaster recovery is when you need to start up the disaster data center because of some disaster. This can be eliminated with some testing. One way to test the disaster process is to wall off the disaster center by not letting any outbound traffic find its way out of the data center. Then stopping the periodic copying and letting the disks be used by the disaster center to recover. Once walled off, you should then test the environment and verify that the disaster center can recover using the replicated data and verify that the disaster data center can process load. As you do this, you should pick a time when you can tolerate the increased risk due to not copying the data. Once you have verified that recovery is complete, you can reconnect the copy and continue with your level of protection.

Starting up the disaster center

Starting up the data center requires a number of things. Most of these things can be done very early on and can be scripted for consistency, ease, and repeatability. The steps that you need to take to bring up the data center are:

  1. Attach the storage to the OS images in the disaster data center.
  2. Start up the WebSphere environment.
  3. Open up the data center for work.

Selecting the snapshot

When you start up your disaster data center, you will need to select the snapshot. Selecting the run data snapshot, will normally be the last snapshot that was transferred to the disaster data center. The one exception will be if you were in the middle of installing or uninstalling an application at the time of the failure. Then you will need to select the run data snapshot that was taken prior to saving the changes.

Selecting the configuration data and the installation data will normally be the last snapshot that was transferred to the disaster data center. The exception is those situations where two snapshots are taken as the result of a single install or configuration action. In this case, you need to select the last snapshot where the two snapshots are consistent.

Process

The activities associated with attaching the storage to the OS are:

  1. Change the mode of the disks from replicating data from the original data center to being the source of the data for the operating system images.
  2. Restore the last good snapshot to the disks that will be used for the recovery data center. For example, the last snapshot of the install volume, the config volume and the run data. Which snapshot to use will be dependent upon the install and configuration actions that were or were not active at the time of the snapshot. Use the snapshot that is consistent with your run data.
  3. Make the disks available to the operating system image. This may require some procedures to check and fix any files that were corrupted because of the "crash" state.

The activities associated with starting up the WebSphere environment:

  1. Start the database server.
  2. Start the dmgr and the node agents.
  3. Start the messaging clusters.
  4. Start the supporting clusters.
  5. Start the application deployment clusters.

To open the data center for work, set your routers to point to the new environment. Do not underestimate the effort required to build a reliable and repeatable connection to the replicated data. Also, make sure that you continually test this path. If anything goes wrong in this process and the replicated data is not available to the operating systems, then the impact of your disaster will continue to compound. This is one of the key paths that should be tested frequently and completely.

What to monitor in your disaster data center

  • After you start the messaging clusters, you can look at the messages in the messaging engines. These will show you any messages that were on the queues at the time that the snapshot was taken. You should not do anything with them, but it might be interesting for you to see the queue depth of some of the queues.
  • If you choose to look at the messages, you might see that some of them have a transaction associated with it. The messages with this transaction association are those messages that are in the process of being placed on the queue or being removed from the queue and for which the transaction is in flight.
  • After you start the support cluster, you will notice that the messages that are associated with the business events will start to be processed and any transactions associated with them will be completed (unless the application clusters are required to complete the transaction).
  • Once you have started the application cluster, you will notice that the messages that are associated with the applications will be processed and the transactions associated with them will be completed.
  • When the entire disaster data center has completed the start up process, you should see all of the business processes in a consistent state ready for the next actions to take place.

Extending the principles presented in this article

The particulars of the environment, as laid out in this article, are based on a number of design choices. In this section, you will examine some possible alternative choices and the impact those choices have on the environment described thus far.

Less capacity in the disaster data center

If you have less capacity in your disaster data center than your primary data center, you will still need the disaster data center to have the same configuration as the primary data center. However, you can serialize your recovery to reduce the instantaneous required capacity. This serialization allows you to roll recovery through the cell and only require enough capacity to support less than the full topology.

Implementing this rolling recovery,, will require a development effort to orchestrate recovery. For example, you don't need to have failover messaging enabled for recovery, so you could reduce the number of servers needed to support the messaging to only those with active messaging engines. In addition, you could serialize the recovery of each application cluster member; allowing the cluster member to recover any activity then, when completed, stop that cluster member and move on to the next.

Another example in staging a recovery is when your environment includes WebSphere Business Monitor. You can recover all of your applications that generate the events prior to recovering the processing of those events.

However, if you decide to recover with less capacity, you will need to develop the orchestration of the recovery activities, and it will very likely need a good understanding of the applications.

Not using disk replication for install and configuration data

You can choose to use other methods to copy the data to the disaster data center. For example, you may want to look into backupConfig and restoreConfig commands. You may also be interested in the customized installation package technology (CIP). As you consider any of these technologies you should always be aware of the conditions that the data for the messaging engines and the data for the transaction managers can only be used by servers that share the exact configuration.

For the install images, you can use a "dual maintenance" approach where you can create two "identical" data centers by running all the same commands. However, you cannot create the configuration image using that same approach. Running the same commands in two different data centers will result in different UUIDs for the resources. The data in the primary data center is only useful in the disaster data center if both data centers have the same UUIDs. As a result, running the same configuration commands in the disaster data center as you did in the primary data center will not give the disaster recovery solution we describe in this paper.

Using the same host names in the disaster center

When you use the same host names or the same IPs in both the original data center and in the disaster data center, you do not need to include the scripts that change the host name or host IPs nor do you need to configure the virtual names to deal with host name changes.

Using a database replication product

If you are using asynchronous replication, don't be tempted to use a database replication product for the replication of your run data. This is because the data that is managed by the database is insufficient. The database replication products do nothing to synchronize your WebSphere transaction logs and will leave you with a disaster data center with inconsistent state. A snapshot of the run data needs to include both the database data and the transaction logs. Anything else will give you an inconsistent state at the disaster recovery site.

Taking advantage of the install registries

One of the disadvantages of using a copy of the installation data rather than managing the installation of the disaster data center by applying the same install actions on both data centers, is that you cannot uninstall the binaries on the disaster data center. This is because the uninstall code uses install registries to recognize the installed components. If you would like to create your disaster center so that it can take advantage of these registries, then you will need to install WebSphere Process Server or Enterprise Service Bus on the disaster center, rather than just rely upon the existence of the mounted file system.

When you do these installs, you should mount a volume for the install that will be replaced by the replicated copy of the install image. (The ismp registry is found in a file that has a location similar to /root/vpd.properties.)

Integration with the rest of your data center

This article has covered the WebSphere components of your data center. You may want to include the installation and configuration of other components of your data center (for example your database). However, how to do this and what should be included in each volume is not covered here. Of course, the exception to this integration is the run data of the database. The tables and logs of the database and the transaction logs of WebSphere do need to be included in the same consistency group.

Disaster recovery configuration considerations

  • A disk subsystem that can capture a snapshot of your data in your primary data center and replicate it to your disaster data center. This subsystem may be attached either as a network attached storage (NAS) device or as a part of your storage attached network (SAN).
  • Three replication volumes set up with guaranteed consistency for each of the volumes. One volume is set up for install, one for configuration and one for the runtime data. The runtime data includes the transaction logs of WebSphere and the logs and data of the database.
  • The snapshots for install and configuration are taken as needed and as the last step of any install or configuration change. The run data snapshot will be taken periodically and that period defines the recovery point of the disaster data center.
  • When you choose to have different host names in the disaster data center than you had for the original data center, then some changes are required in the configuration files as they are moved to the disaster data center. The host name of the database that is configured to be used by WebSphere must also be carefully managed.

Basic operational flow (ongoing operations)

Perform the following ongoing operations:

  1. When you make an installation change, take a snapshot of the installation data and transfer it to the disaster data center.
  2. When you make a profile change, take a snapshot of the configuration data and a snapshot of the installation data and transfer them to the disaster data center.
  3. When you install or uninstall an application, take a snapshot of the run data prior to saving the changes. After saving the changes, take a snapshot of the run data and a snapshot of the configuration data.
  4. When you make any other configuration changes, take a snapshot of the configuration data, and transfer it to the disaster data center.
  5. Continuously take snapshots of the run data and transfer them to the disaster data center.

Start up the disaster data center:

  1. Determine the snapshot to use for each data volume.
  2. Make the data available on the disks.
  3. Make the disk image available to the LPARs of the disaster data center.
  4. Change the host names or IP addresses (if required).
  5. Start up the database and WebSphere administration processes.
  6. Start up the WebSphere clusters to recover all in-flight work (first messaging, then support, and then application deployment).
  7. Open up the data center for new work.

You must test your environment periodically to verify that the mechanisms that you have for connecting the disk to the operating system continues to be a reliable connection.

Conclusion

This article described an environment that uses asynchronous replication of a disk replication system to implement a disaster recovery data center that includes WebSphere Process Server or WebSphere Enterprise Service Bus.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Business process management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Business process management, WebSphere
ArticleID=340451
ArticleTitle=Asynchronous replication of WebSphere Process Server and WebSphere Enterprise Service Bus for disaster recovery environments
publish-date=09242008