Faster disaster recovery in IBM Business Process Manager

Implementing a multiple data center approach to improve recovery time

This article describes an infrastructure topology for IBM® Business Process Manager that includes elements that reside in distinct data centers that may be geographically separated from each other. Such a topology can be useful in achieving disaster recovery objectives in certain circumstances, especially when recovery times faster those offered by traditional approaches are desired. Additionally, the strategy described in this paper uses Oracle®'s Data Base File System (DBFS) to enable the database manager to control replication of the WebSphere® transaction and compensation logs, as well as traditional IBM BPM database content. This content is part of the IBM Business Process Management Journal.

Share:

Yu Zhang (zhangyzy@cn.ibm.com), Senior Software Engineer, IBM

Yu Zhang photoYu Zhang is a system verification test (SVT) architect for business process management (BPM) at the IBM China Software Development Lab. He has rich experience in test methodologies and deep technical background in the J2EE and BPM areas. He is currently focused on BPM high availability and disaster recovery testing.



Chris Richardson (chrisri@us.ibm.com), Software Architect, IBM

Chris Richardson photoChris Richardson is an architect focused on scalability, resilience and performance on the IBM Business Process Manager development team. Chris has more than 14 years experience working on Software Systems Performance with an emphasis on tooling and methodology, Java virtual machines, and BPM middleware. Chris holds a Master of Science degree from the University of Washington in Seattle.



Jing Wen Cui (jcui@vmware.com), Technical Staff Member, VMWare

Jing Wen Cui photoJing Wen Cui was formerly a Staff Software Engineer in IBM China, where she worked on IBM BPM for more than six years. Her expertise areas include BPEL, SCA, web service security and attachment, workflow patterns, SAML, and WebSphere Portal. She now works at VMWare, where she focuses on cloud-related testing and development.



Eric Herness (herness@us.ibm.com), Distinguished Engineer, IBM

Eric Herness photoEric Herness is an IBM Distinguished Engineer and is the Chief Architect for business process management (BPM) in IBM Software Group. Eric is also the CTO for the business unit focused on BPM and operational decision management (ODM), where he leads the architects who define product and technical direction for the business.

Eric has worked with many large customers as they have adopted BPM and ODM approaches. He has had key lead architectural roles in WebSphere for more than 15 years. Eric has an MBA from the Carlson School at the University of Minnesota.



Karri Carlson-Neumann (karricar@us.ibm.com), Senior Software Engineer, IBM

Karri Carlson-Neumann photoKarri Carlson-Neumann is a Senior Software Engineer on the WebSphere Process Server development team in Rochester, Minnesota. She has been involved with the development of WebSphere Business Integration Server Foundation and WebSphere Process Server for many years. She currently works in a bring-up lab and is focused on the deployment and integration of WebSphere Process Server.



28 August 2013

Introduction

Many IBM Business Process Manager (IBM BPM) customers, especially enterprise customers, need the ability to recover from a disaster. This requires careful planning, a disaster recovery (DR) data center, and a strategy for replicating your data to the DR site. Today's most common and proven approach is using storage area network (SAN) technology to asynchronously replicate the disks containing the database and the transaction logs, together with a cloned, "cold" cell in the DR site. This approach is documented in the article Asynchronous replication of WebSphere Process Server and WebSphere Enterprise Service Bus for disaster recovery environments. However, some modern BPM solutions require a near zero recovery Point objective (RPO) and a recovery time objective (RTO) that is faster than can generally be achieved via the approach described in that article. That is, the DR site needs to be able to get up and running quickly, against very recently replicated data.

Traditional SAN-based asynchronous replication can't provide aggressive RTO because establishing service in the recovery environment usually takes a long time due to its "cold start." In the article Using Oracle Database File System for disaster recovery on IBM Business Process Manager, we introduced a disaster recovery solution that utilizes database-managed replication, which can provide zero or near zero RPO; however, its RTO still includes all of the time required to start up the cold recovery environment. This article introduces a new, innovative disaster recovery strategy (called stray node for short) to reduce the RTO.

The basic idea of stray node is to build a cross-data center BPM environment, keeping node agents in the standby data center running during normal operations while ensuring that the cluster members themselves are disabled. During recovery, only the cluster members in the standby data center have to be started. Where applicable, this approach could be used to deliver an RPO on the order of tens of minutes rather than hours. However, keeping in mind that IBM BPM is just part of the whole environment, even though IBM BPM can be started in minutes, the total recovery time can still be days or hours if the systems IBM BPM depends on can't be recovered in a timely manner.

The following sections describe how to configure the environment and also common problems and solutions when implementing this strategy.

Although the investigative work described in this article was carried out using IBM BPM V8.0.1, the same concepts can be applied to IBM BPM V8.5 and later versions.

The cross data center deployment environment

The basis of the stray node approach is to have the WebSphere® cell that underpins IBM BPM be split across data centers. In general, a cross data center cell is not recommended. The rationale is explained in great detail in the column Everything you always wanted to know about WebSphere Application Server but were afraid to ask. As described in that column, extreme care should be taken when designing a cross data center topology. Since the DR data center is expected to be geographically separated from the primary data center, the biggest concern is network latency, which could introduce unexpected results. Because there may be a large distance between the primary and disaster recovery data centers and network latency can be a real problem, in order to make a cross data center deployment environment work, you need to minimize cross data center network communications. You also need to make sure the master configuration changes are propagated to the standby data center during normal deployment manager operations. You can achieve this goal by using following configurations:

  • Making the standby data center's node agents active during normal operations to receive configuration updates, ensuring that configuration changes made in the primary environment are propagated to the secondary environment.
  • Moving the standby data center's node agents into a separate core group in order to allow them to remain active during normal operations without incurring the cost of network communication between the primary and standby node agents.
  • Stopping the standby data center's cluster members during normal processing in order to allow them to remain within the same core group as the primary cluster members without incurring the cost of network communication between the primary and secondary cluster members.

Figure 1 provides an example that illustrates this idea.

Figure 1. Sample solution topology
Sample solution topology

In the diagram above, you can see that:

  • There are two data centers, generally separated by some distance.
  • There are two IBM BPM nodes in each data center.
  • All four nodes are federated into the same cell.

A typical four cluster deployment environment is created on the four nodes. In the standby environment (during normal processing) only the node agents are started; the cluster members themselves are stopped. There are two core groups in this configuration. Servers inside the purple polygon are in the default core group; servers inside the yellow rectangle are in a separate core group.

Following is a summary of the set-up procedures used to create this environment:

  1. Configure Oracle® Data Guard, Database File System (DBFS) and Network File System (NFS) Server on the database server machine.
  2. Configure the NFS client on all IBM BPM server machines and mount the remote drive.
  3. Install IBM BPM on all BPM machines and create corresponding profiles.
  4. Federate all profiles into the same cell.
  5. Create a deployment environment that maps cluster members on all nodes.
  6. Configure all servers' transaction logs to the NFS point.
  7. Create a separate core group and move the standby data center's node agents into it.

This approach avoids or addresses many of the concerns described in the referenced column. This gives us a good basis from which to further explain how to leverage this approach to achieve reduced RTO. The following section describes how to set up the sample environment in detail.

WebSphere does not allow the same cluster members to be scattered in different core groups.

Overview of IBM BPM disaster recovery lab environment

The sample lab environment used for this solution consists of two BPM environments:

  • A primary production site
  • A standby disaster recovery site

The two data centers are connected by WAN. IBM BPM V8.0.1 A is installed in each node. Table 1 shows the details of the software configuration for the primary and standby sites.

Table 1. Software configuration for the primary and standby sites
Site Host information Installed components Other configuration
Primary OS: Red Hat® Enterprise Linux Server release 6.1 Hostname: rehl217.cn.ibm.com IP: 9.115.198.217 Oracle 11g R2
Primary OS: Red Hat Enterprise Linux Server release 6.1 Hostname: rel57.cn.ibm.com IP: 9.115.198.57 BPM Advanced Edition v8.0.1 (Deployment Manager node and Custom Node) A host alias name is used for oracle machine in hosts file: 9.115.198.217 oracle.cn.ibm.com
Primary OS: Red Hat Enterprise Linux Server release 6.1 Hostname: rel58.cn.ibm.com IP: 9.115.198.58 BPM Advanced Edition v8.0.1 (Custom Node) A host alias name is used for oracle machine in hosts file: 9.115.198.217 oracle.cn.ibm.com
Standby OS: Red Hat Enterprise Linux Server release 6.1 Hostname: rehl218.cn.ibm.com IP: 9.115.198.218 Oracle 11g R2
Standby OS: Red Hat Enterprise Linux® Server release 6.1
Hostname: rel67.cn.ibm.com IP: 9.115.198.67
IBM BPM Advanced V8.0.1 (Custom Node) A host alias name is used for the Oracle machine in the hosts file: 9.115.198.218 oracle.cn.ibm.com
Standby OS: Red Hat Enterprise Linux Server release 6.1
Hostname: rel68.cn.ibm.com IP: 9.115.198.68
IBM BPM Advanced V8.0.1 (Custom Node) A host alias name is used for the Oracle machine in the hosts file: 9.115.198.218 oracle.cn.ibm.com

As described in the table above, a static hosts file is used to map the same database host name to different IP addresses. If this static DNS resolution is not applicable, another technology can be used. The basic idea is to include both the primary and standby Oracle databases IP addresses in the Oracle connection string. The following section demonstrates this approach.

Configure the failover-enabled Oracle connection string

After the IBM BPM deployment environment has been configured, you need to update all Oracle data sources with a connection string, as shown in Listing 1, using either the admin console or scripting.

Listing 1. Configure the Oracle connection string
jdbc:oracle:thin:@
(DESCRIPTION= (ADDRESS_LIST=
(ADDRESS=(PROTOCOL=TCP)(HOST=rehl217.cn.ibm.com)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=rel68.cn.ibm.com)(PORT=1521))
(LOAD_BALANCE=off) (FAILOVER=on) )
(CONNECT_DATA= (SERVER=DEDICATED)
(SERVICE_NAME=BPM) 
)

After updating data sources, you must create a dedicated Oracle service. Use the commands in Listing 2 to create a service in the primary database. The service will then be created automatically in the standby database during Oracle Data Guard configuration.

Listing 2. Create and start oracle service
Exec DBMS_SERVICE.CREATE_SERVICE('BPM','BPM');
exec DBMS_SERVICE.START_SERVICE('BPM');

In order to start the service each time a database comes up, you need to create an additional trigger, as shown in Listing 3. Issue the command on the primary database, then the trigger will be created automatically on the standby database.

Listing 3. Create start-up trigger
CREATE OR REPLACE TRIGGER START_SERVICES AFTER STARTUP ON DATABASE 
DECLARE
 ROLE VARCHAR(30);
BEGIN
 SELECT DATABASE_ROLE INTO ROLE FROM V$DATABASE;
 IF ROLE = 'PRIMARY' THEN
 DBMS_SERVICE.START_SERVICE('BPM');
 END IF;
END;

In this approach, when recovering the standby site, there is no need to change the WebSphere data source connection string to point to the standby database. The Oracle JDBC driver will automatically fail over to the standby database during runtime.

Oracle Real Application Clusters (RAC) with Single Client Access Name (SCAN)

IBM BPM can work with Oracle RAC through either a SCAN address or a normal failover connection string. If using a SCAN address, the primary and standby environments' DNS server must be configured to resolve the SCAN address to different IP addresses.


Set-up procedures

Refer to our article Using Oracle Database File System for disaster recovery on IBM Business Process Manager for detailed steps for setting up Oracle DBFS, NFS and BPM. This section focus solely on the additional IBM BPM configuration required to enable the stray node approach.

Configure IBM BPM

Install IBM BPM binary on the four BPM machines. Create a deployment manager and custom profile on the first machine (rel57.cn.ibm.com, in our case), and custom profiles on the other three machines. Then federate the four nodes into the deployment manager. Finally, create a deployment environment and make sure it uses the four nodes, as shown in Figure 2.

Figure 2. Deployment environment wizard
Deployment environment wizard

Once the deployment environment has been generated successfully for all servers, configure transaction logs for each server into a directory contained on the NFS mount point.

Next, you can continue to configure another core group.

Configure a core group

All servers except the deployment manager should be stopped before configuring core group.

Log in to the admin console, select Core Groups => Core group settings, then click New, as shown in Figure 3.

Figure 3. New core group
New core group

In the New Group dialog, shown in Figure 4, specify a unique value for the new core group name (in this example, StandbyCoreGroup). Accept the default values for other properties. (Or, refer to the WebSphere Application Server Information Center to get a full understanding of those settings and then set them appropriately for your business scenario). Click OK and save your changes.

Figure 4. Core group configuration
Core group configuration

Navigate back to the Core Group dialog, and select DefaultCoreGroup. In the detailed information dialog, under Additional properties, click Core group servers. In the Core Group Servers dialog, shown in Figure 5, select the standby data center's node agents and click Move.

Figure 5. Move standby node agents
Move standby node agents

In the next screen, shown in Figure 6, select the new core group you created, click OK, and then save the configuration.

Figure 6. Move node agents to another core group
Move node agents to another core group

Finally start the primary and standby data centers' node agents and do a full synchronization. Restart both the deployment manager and the node agents to to make the configuration take effect.

Verify the configuration

After all configurations have been done, you need to make sure the deployment environment can work properly in either the primary or standby site. You can use the following procedures to verify the whole environment.

Make sure that the standby data center's node agents are always in started status and verify that process applications have been deployed to the deployment environment and that all nodes are in synchronized status.

Verify core group configuration

As described earlier, the key point of this disaster recovery strategy is to avoid network communication between the primary and standby data centers. You need to verify that the core group isolation can archive this goal. You can do this by monitoring the network traffic between primary and standby data centers. Because the communication between core group members is via DCS_UNICAST_ADDRESS, you can monitor whether there are messages exchanged through this port at the standby data center's node agent servers. In order to do that, you need to make sure that the node agents have been started at the standby data center and that the primary data center's deployment manager and node agents are started as well.

If there are no messages exchanged, it means the configuration is correct.

Refer to core group transports in the WebSphere Application Server Information Center to understand the network communication among core group members.

Verify the primary data center configuration

To verify the primary data center configuration, do the following:

  1. Start the primary data center's cluster members in order using either the admin console or scripts. (Messaging => Support => Application => Web).
  2. Start some process instances and verify that end-to-end business processes can be completed successfully. Then, make sure there are running process instances (waiting for human interaction, external events, or expiration).

If business processes can be completed successfully, you've verified that the primary data center's configuration is correct.

Verify the standby data center configuration

Before taking any actions on the standby data center, make sure the entire BPM environment in the primary data center (including cluster members, node agents and the deployment manager) is stopped. Because the purpose of this testing exercise is to verify the configuration (as opposed to simulating a disaster), you can smoothly stop the primary environment.

To verify the configuration of the standby data center, do the following:

  1. On the database machine, switch the Oracle role from standby to primary.
  2. Start the standby cluster members in order (Messaging => Support => Application => Web).
  3. Verify that existing running process instances can be completed successfully.
  4. Verify that new instances can be started and completed successfully.

If each of these steps can be executed successfully, you have verified that the whole environment has been configured properly. In a production environment, those steps can be automated to reduce the recovery time and the chance of human errors.


Test the recovery procedure

The recovery procedure is used in the event of a catastrophic failure of the primary data center. In this section, we'll verify that this configuration allows continuation of business processing across a catastrophic failure of the primary data center, and we will measure the amount of time taken to enable processing in the standby data center (RTO).

Prepare for disaster recovery simulation

Before disaster simulation, you need to simulate normal processing in the IBM BPM environment by preparing two sets of instances:

  • BPEL and BPMN instances, for verification that BPM can perform well on those instances after disaster.
  • A variety of workloads (both long-running BPEL and BPMN), to make sure there are business process instances in different transaction phases when the simulated disaster happens, for verification that the BPM server can deal with those transactions correctly after the disaster. Based on the capacity of the test machine used in our demonstration, we used 100 concurrent users to start long-running BPEL/BPD instances concurrently.

The detailed steps to prepare the instances are:

  1. Generate BPMN process instances that pause at a variety of types of activity (such as human task inside linked process).
  2. Generate failed/suspended/terminated BPMN instances.
  3. Generate long-running BPEL instances that stop at a standalone human task.
  4. Generate failed/terminated BPEL instances.
  5. In our testing, we used JMeter to generate load on the servers in the primary data center.

Simulate disaster

Once an appropriate business load is simulated, you shut down the entire primary data center environment to simulate a catastrophic failure.

Recover the standby site

After the disaster has been recognized, the standby site needs to take over to recover the BPM system. The detailed steps to achieve this are:

  1. Convert the standby database to a primary database.
  2. Backup the database. (optional)
  3. Launch dbfs_client to export the DBFS point in the NFS server.
  4. Restart the NFS server.
  5. Mount the DBFS point in each BPM machine.
  6. Start the cluster members in order (Messaging => Support => Application => Web).
  7. (The WebSphere Transaction Service will automatically start recovery during server start-up.)

Verify the recovery

After recovery is complete, verify that the solution works well by doing the following:

  1. Verify that BPEL and BPMN instances can be continued successfully.
  2. Verify that new BPEL and BPMN instances can be started successfully.
  3. For in-doubt transactions, check the servers' SystemOut.log file for information indicating which transactions were recovered, meaning they were either committed or rolled back successfully.

The recovery timeline

According to our testing, we can recover the standby data center within 15 minutes, as shown in Figure 7. Of course, this applies only to IBM BPM itself. Actual data center recovery involves not only BPM, but also other business systems as well. Therefore, recovery times depend upon the details of the installation and will vary from one environment to another.

Figure 7. Recovery standby data center
Recovery standby data center

Standby data center planning

As shown in Figure 1, the standby data center has the same capacity as the primary data center. In this configuration, when a disaster occurs, the standby data center can fully take over the primary data center's workload. Depending on business requirements, the standby data center could contain less total capacity than the primary data center, especially if the standby data center is only intended to take over critical business operations. You need to carefully plan for the standby data center's capacity before implementing this disaster recovery strategy.


Common pitfalls and solutions

This hot/warm disaster recovery strategy presents some technical challenges that are not present when using a traditional active-cold disaster recovery solution. Therefore, you should take special care when implementing this disaster recovery strategy. This section describes some of those pitfalls and their corresponding solutions.

Database server time zones

If the distance between the primary and secondary data centers is quite long, it's possible that they will be located in different time zones. In this case, it may be tempting to configure the two database servers into different time zones. However, this configuration is not supported in the IBM BPM V8.0.1. The same time zone must be configured on both database servers. This limitation is documented in the IBM Business Process Manager Information Center.

Messaging engine cluster member start-up

In a three or four cluster deployment environment (the topologies with remote messaging clusters), the messaging engine cluster members must be started before starting the members of the other clusters. If there is only one messaging engine cluster member at the standby site, you may notice that the messaging server can't be started until another server in the same core group is started and that the SystemOut log file repeatedly reports the following information:

Listing 4. Messaging engine server startup
[11/9/12 16:52:24:809 GMT+08:00] 00000007 RLSHAGroupCal W   CWRLS0030W: Waiting for HAManager 
to activate recovery processing for local WebSphere server

This is a known issue in WebSphere Application Server. In order to avoid this issue, ensure that at least two messaging engine cluster members are present in the standby data center. Usually, only one of the messaging engine cluster members will be active at a time.

Cross-cell Enterprise JavaBeans (EJB) look-up failure

When the standby site takes over, name resolution for EJB look-ups may fail. If the application contains an EJB client that looks up BPM EJBs implemented in another WebSphere cell, the lookup may fail as documented in the IBM RedBook®Techniques for Managing Large WebSphere Installations (page 67). For applications making use of this type of EJB interaction (this is not common), the problem can be avoided by creating an additional node agent in the standby data center, assigning it to the same core group as the cluster members and node agents in the primary data center and creating a core group bridge between the two core groups. In order to avoid cross-site network traffic during normal operations, this additional node agent is not started until after a disaster is detected. Figure 8 illustrates this solution.

Figure 8. EJB lookup failure solution
EJB lookup failure solution

To create the core group bridge, complete the following steps:

  1. In the Core group bridge settings page on WebSphere admin console, select Access point groups, as shown in Figure 9.
    Figure 9. Core group bridge settings
    Core group bridge settings
  2. Select DefaultAccessPointGroup => Core group access points, then select CGAP_1\DefaultCoreGroup, and click Show Details, as shown in Figure 10.
    Figure 10. Core group access points
    Core group access points
  3. Under Additional Properties, click Bridge interface and then click New.
    Figure 11. Access point
    Access point
  4. In the Bridge interfaces drop box, select the standalone node agent and then click OK, as shown in Figure 12.
    Figure 12. Bridge interface
    Bridge interface
  5. Return to the Core group access points dialog (Figure 10, and repeat the same steps for the other core group, making sure to select the standby data center's node agent as the bridge interface.

Figure 13 shows the final result after saving and restarting the whole environment.

Figure 13. Access point groups
Access point groups

During normal processing, the standalone node agent is stopped. When recovering the standby data center, this node agent must be started before starting other cluster members.


Additional considerations

Process Center:
In the context of an overall solution topology, the concepts described in this paper apply to both Process Server cells and should be applied to the Process Center cells also. Remember, that in order to maintain and do any changes to process applications that are under the control of Process Center, this too will need to be recovered and made operational on the recovery site.
Deployment manager:
Please note that there is no deployment manager on the standby site; this is different from the typical disaster recovery scenarios that have been described for IBM BPM in the past. In order to accommodate this, either file-based replication or a recovery node technique can be used to make the deployment manager available on the standby site. The article Run time management high availability options, redux has great details on different approaches to making the deployment manager highly available.

Profile backup and restore can be applied as well, but compared with file replication or recovery node approaches, it will take more time to recover the deployment manager at the standby site. So in general, depending on your IT infrastructure and your business requirement on the deployment manager's recovery time, you can choose the approach that best suites your environment.

Variations on the strategy:
There is another variation that involves applying the stray node concept while still leveraging SAN replication. While this is possible, it is not the subject of this article. You can contact the authors directly if you are interested in this combination. Our general point of view is that while this should work, it will not provide as great an improvement on RTO because of the number and length of the recovery procedures required as compared to the database-based approach to data replication.

Conclusion

This article described a useful additional topology for enterprise solutions that require reduced recovery times (RTO) for disaster recovery. We've described the concept of a "stray node" and how properly setting up and configuring the primary and the recovery site can allow this stray node concept to enable recovery faster than previously documented methods.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Business process management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Business process management, WebSphere
ArticleID=942297
ArticleTitle=Faster disaster recovery in IBM Business Process Manager
publish-date=08282013