Faster disaster recovery in IBM Business Process Manager
Implementing a multiple data center approach to improve recovery time
Many IBM Business Process Manager (IBM BPM) customers, especially enterprise customers, need the ability to recover from a disaster. This requires careful planning, a disaster recovery (DR) data center, and a strategy for replicating your data to the DR site. Today's most common and proven approach is using storage area network (SAN) technology to asynchronously replicate the disks containing the database and the transaction logs, together with a cloned, "cold" cell in the DR site. This approach is documented in the article Asynchronous replication of WebSphere Process Server and WebSphere Enterprise Service Bus for disaster recovery environments. However, some modern BPM solutions require a near zero recovery Point objective (RPO) and a recovery time objective (RTO) that is faster than can generally be achieved via the approach described in that article. That is, the DR site needs to be able to get up and running quickly, against very recently replicated data.
Traditional SAN-based asynchronous replication can't provide aggressive RTO because establishing service in the recovery environment usually takes a long time due to its "cold start." In the article Using Oracle Database File System for disaster recovery on IBM Business Process Manager, we introduced a disaster recovery solution that utilizes database-managed replication, which can provide zero or near zero RPO; however, its RTO still includes all of the time required to start up the cold recovery environment. This article introduces a new, innovative disaster recovery strategy (called stray node for short) to reduce the RTO.
The basic idea of stray node is to build a cross-data center BPM environment, keeping node agents in the standby data center running during normal operations while ensuring that the cluster members themselves are disabled. During recovery, only the cluster members in the standby data center have to be started. Where applicable, this approach could be used to deliver an RPO on the order of tens of minutes rather than hours. However, keeping in mind that IBM BPM is just part of the whole environment, even though IBM BPM can be started in minutes, the total recovery time can still be days or hours if the systems IBM BPM depends on can't be recovered in a timely manner.
The following sections describe how to configure the environment and also common problems and solutions when implementing this strategy.
The cross data center deployment environment
The basis of the stray node approach is to have the WebSphere® cell that underpins IBM BPM be split across data centers. In general, a cross data center cell is not recommended. The rationale is explained in great detail in the column Everything you always wanted to know about WebSphere Application Server but were afraid to ask. As described in that column, extreme care should be taken when designing a cross data center topology. Since the DR data center is expected to be geographically separated from the primary data center, the biggest concern is network latency, which could introduce unexpected results. Because there may be a large distance between the primary and disaster recovery data centers and network latency can be a real problem, in order to make a cross data center deployment environment work, you need to minimize cross data center network communications. You also need to make sure the master configuration changes are propagated to the standby data center during normal deployment manager operations. You can achieve this goal by using following configurations:
- Making the standby data center's node agents active during normal operations to receive configuration updates, ensuring that configuration changes made in the primary environment are propagated to the secondary environment.
- Moving the standby data center's node agents into a separate core group in order to allow them to remain active during normal operations without incurring the cost of network communication between the primary and standby node agents.
- Stopping the standby data center's cluster members during normal processing in order to allow them to remain within the same core group as the primary cluster members without incurring the cost of network communication between the primary and secondary cluster members.
Figure 1 provides an example that illustrates this idea.
Figure 1. Sample solution topology
In the diagram above, you can see that:
- There are two data centers, generally separated by some distance.
- There are two IBM BPM nodes in each data center.
- All four nodes are federated into the same cell.
A typical four cluster deployment environment is created on the four nodes. In the standby environment (during normal processing) only the node agents are started; the cluster members themselves are stopped. There are two core groups in this configuration. Servers inside the purple polygon are in the default core group; servers inside the yellow rectangle are in a separate core group.
Following is a summary of the set-up procedures used to create this environment:
- Configure Oracle® Data Guard, Database File System (DBFS) and Network File System (NFS) Server on the database server machine.
- Configure the NFS client on all IBM BPM server machines and mount the remote drive.
- Install IBM BPM on all BPM machines and create corresponding profiles.
- Federate all profiles into the same cell.
- Create a deployment environment that maps cluster members on all nodes.
- Configure all servers' transaction logs to the NFS point.
- Create a separate core group and move the standby data center's node agents into it.
This approach avoids or addresses many of the concerns described in the referenced column. This gives us a good basis from which to further explain how to leverage this approach to achieve reduced RTO. The following section describes how to set up the sample environment in detail.
Overview of IBM BPM disaster recovery lab environment
The sample lab environment used for this solution consists of two BPM environments:
- A primary production site
- A standby disaster recovery site
The two data centers are connected by WAN. IBM BPM V8.0.1 A is installed in each node. Table 1 shows the details of the software configuration for the primary and standby sites.
Table 1. Software configuration for the primary and standby sites
|Site||Host information||Installed components||Other configuration|
|Primary||OS: Red Hat® Enterprise Linux Server release 6.1 Hostname: rehl217.cn.ibm.com IP: 18.104.22.168||Oracle 11g R2|
|Primary||OS: Red Hat Enterprise Linux Server release 6.1 Hostname: rel57.cn.ibm.com IP: 22.214.171.124||BPM Advanced Edition v8.0.1 (Deployment Manager node and Custom Node)||A host alias name is used for oracle machine in hosts file: 126.96.36.199 oracle.cn.ibm.com|
|Primary||OS: Red Hat Enterprise Linux Server release 6.1 Hostname: rel58.cn.ibm.com IP: 188.8.131.52||BPM Advanced Edition v8.0.1 (Custom Node)||A host alias name is used for oracle machine in hosts file: 184.108.40.206 oracle.cn.ibm.com|
|Standby||OS: Red Hat Enterprise Linux Server release 6.1 Hostname: rehl218.cn.ibm.com IP: 220.127.116.11||Oracle 11g R2|
|Standby|| OS: Red Hat Enterprise Linux® Server release 6.1 |
Hostname: rel67.cn.ibm.com IP: 18.104.22.168
|IBM BPM Advanced V8.0.1 (Custom Node)||A host alias name is used for the Oracle machine in the hosts file: 22.214.171.124 oracle.cn.ibm.com|
|Standby|| OS: Red Hat Enterprise Linux Server release 6.1|
Hostname: rel68.cn.ibm.com IP: 126.96.36.199
|IBM BPM Advanced V8.0.1 (Custom Node)||A host alias name is used for the Oracle machine in the hosts file: 188.8.131.52 oracle.cn.ibm.com|
As described in the table above, a static hosts file is used to map the same database host name to different IP addresses. If this static DNS resolution is not applicable, another technology can be used. The basic idea is to include both the primary and standby Oracle databases IP addresses in the Oracle connection string. The following section demonstrates this approach.
Configure the failover-enabled Oracle connection string
After the IBM BPM deployment environment has been configured, you need to update all Oracle data sources with a connection string, as shown in Listing 1, using either the admin console or scripting.
Listing 1. Configure the Oracle connection string
jdbc:oracle:thin:@ (DESCRIPTION= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP)(HOST=rehl217.cn.ibm.com)(PORT=1521)) (ADDRESS=(PROTOCOL=TCP)(HOST=rel68.cn.ibm.com)(PORT=1521)) (LOAD_BALANCE=off) (FAILOVER=on) ) (CONNECT_DATA= (SERVER=DEDICATED) (SERVICE_NAME=BPM) )
After updating data sources, you must create a dedicated Oracle service. Use the commands in Listing 2 to create a service in the primary database. The service will then be created automatically in the standby database during Oracle Data Guard configuration.
Listing 2. Create and start oracle service
Exec DBMS_SERVICE.CREATE_SERVICE('BPM','BPM'); exec DBMS_SERVICE.START_SERVICE('BPM');
In order to start the service each time a database comes up, you need to create an additional trigger, as shown in Listing 3. Issue the command on the primary database, then the trigger will be created automatically on the standby database.
Listing 3. Create start-up trigger
CREATE OR REPLACE TRIGGER START_SERVICES AFTER STARTUP ON DATABASE DECLARE ROLE VARCHAR(30); BEGIN SELECT DATABASE_ROLE INTO ROLE FROM V$DATABASE; IF ROLE = 'PRIMARY' THEN DBMS_SERVICE.START_SERVICE('BPM'); END IF; END;
In this approach, when recovering the standby site, there is no need to change the WebSphere data source connection string to point to the standby database. The Oracle JDBC driver will automatically fail over to the standby database during runtime.
Oracle Real Application Clusters (RAC) with Single Client Access Name (SCAN)
IBM BPM can work with Oracle RAC through either a SCAN address or a normal failover connection string. If using a SCAN address, the primary and standby environments' DNS server must be configured to resolve the SCAN address to different IP addresses.
Refer to our article Using Oracle Database File System for disaster recovery on IBM Business Process Manager for detailed steps for setting up Oracle DBFS, NFS and BPM. This section focus solely on the additional IBM BPM configuration required to enable the stray node approach.
Configure IBM BPM
Install IBM BPM binary on the four BPM machines. Create a deployment manager and custom profile on the first machine (rel57.cn.ibm.com, in our case), and custom profiles on the other three machines. Then federate the four nodes into the deployment manager. Finally, create a deployment environment and make sure it uses the four nodes, as shown in Figure 2.
Figure 2. Deployment environment wizard
Once the deployment environment has been generated successfully for all servers, configure transaction logs for each server into a directory contained on the NFS mount point.
Next, you can continue to configure another core group.
Configure a core group
Log in to the admin console, select Core Groups => Core group settings, then click New, as shown in Figure 3.
Figure 3. New core group
In the New Group dialog, shown in Figure 4, specify a unique value for the
new core group name (in this example,
Accept the default values for other properties. (Or, refer to the WebSphere Application Server Information Center
to get a full understanding of those settings and then
set them appropriately for your business scenario). Click
OK and save your changes.
Figure 4. Core group configuration
Navigate back to the Core Group dialog, and select DefaultCoreGroup. In the detailed information dialog, under Additional properties, click Core group servers. In the Core Group Servers dialog, shown in Figure 5, select the standby data center's node agents and click Move.
Figure 5. Move standby node agents
In the next screen, shown in Figure 6, select the new core group you created, click OK, and then save the configuration.
Figure 6. Move node agents to another core group
Finally start the primary and standby data centers' node agents and do a full synchronization. Restart both the deployment manager and the node agents to to make the configuration take effect.
Verify the configuration
After all configurations have been done, you need to make sure the deployment environment can work properly in either the primary or standby site. You can use the following procedures to verify the whole environment.
Verify core group configuration
As described earlier, the key point of this disaster recovery strategy is to avoid network communication between the primary and standby data centers. You need to verify that the core group isolation can archive this goal. You can do this by monitoring the network traffic between primary and standby data centers. Because the communication between core group members is via DCS_UNICAST_ADDRESS, you can monitor whether there are messages exchanged through this port at the standby data center's node agent servers. In order to do that, you need to make sure that the node agents have been started at the standby data center and that the primary data center's deployment manager and node agents are started as well.
If there are no messages exchanged, it means the configuration is correct.
Refer to core group transports in the WebSphere Application Server Information Center to understand the network communication among core group members.
Verify the primary data center configuration
To verify the primary data center configuration, do the following:
- Start the primary data center's cluster members in order using either the admin console or scripts. (Messaging => Support => Application => Web).
- Start some process instances and verify that end-to-end business processes can be completed successfully. Then, make sure there are running process instances (waiting for human interaction, external events, or expiration).
If business processes can be completed successfully, you've verified that the primary data center's configuration is correct.
Verify the standby data center configuration
To verify the configuration of the standby data center, do the following:
- On the database machine, switch the Oracle role from standby to primary.
- Start the standby cluster members in order (Messaging => Support => Application => Web).
- Verify that existing running process instances can be completed successfully.
- Verify that new instances can be started and completed successfully.
If each of these steps can be executed successfully, you have verified that the whole environment has been configured properly. In a production environment, those steps can be automated to reduce the recovery time and the chance of human errors.
Test the recovery procedure
The recovery procedure is used in the event of a catastrophic failure of the primary data center. In this section, we'll verify that this configuration allows continuation of business processing across a catastrophic failure of the primary data center, and we will measure the amount of time taken to enable processing in the standby data center (RTO).
Prepare for disaster recovery simulation
Before disaster simulation, you need to simulate normal processing in the IBM BPM environment by preparing two sets of instances:
- BPEL and BPMN instances, for verification that BPM can perform well on those instances after disaster.
- A variety of workloads (both long-running BPEL and BPMN), to make sure there are business process instances in different transaction phases when the simulated disaster happens, for verification that the BPM server can deal with those transactions correctly after the disaster. Based on the capacity of the test machine used in our demonstration, we used 100 concurrent users to start long-running BPEL/BPD instances concurrently.
The detailed steps to prepare the instances are:
- Generate BPMN process instances that pause at a variety of types of activity (such as human task inside linked process).
- Generate failed/suspended/terminated BPMN instances.
- Generate long-running BPEL instances that stop at a standalone human task.
- Generate failed/terminated BPEL instances.
- In our testing, we used JMeter to generate load on the servers in the primary data center.
Once an appropriate business load is simulated, you shut down the entire primary data center environment to simulate a catastrophic failure.
Recover the standby site
After the disaster has been recognized, the standby site needs to take over to recover the BPM system. The detailed steps to achieve this are:
- Convert the standby database to a primary database.
- Backup the database. (optional)
- Launch dbfs_client to export the DBFS point in the NFS server.
- Restart the NFS server.
- Mount the DBFS point in each BPM machine.
- Start the cluster members in order (Messaging => Support => Application => Web).
- (The WebSphere Transaction Service will automatically start recovery during server start-up.)
Verify the recovery
After recovery is complete, verify that the solution works well by doing the following:
- Verify that BPEL and BPMN instances can be continued successfully.
- Verify that new BPEL and BPMN instances can be started successfully.
- For in-doubt transactions, check the servers' SystemOut.log file for information indicating which transactions were recovered, meaning they were either committed or rolled back successfully.
The recovery timeline
According to our testing, we can recover the standby data center within 15 minutes, as shown in Figure 7. Of course, this applies only to IBM BPM itself. Actual data center recovery involves not only BPM, but also other business systems as well. Therefore, recovery times depend upon the details of the installation and will vary from one environment to another.
Figure 7. Recovery standby data center
Standby data center planning
As shown in Figure 1, the standby data center has the same capacity as the primary data center. In this configuration, when a disaster occurs, the standby data center can fully take over the primary data center's workload. Depending on business requirements, the standby data center could contain less total capacity than the primary data center, especially if the standby data center is only intended to take over critical business operations. You need to carefully plan for the standby data center's capacity before implementing this disaster recovery strategy.
Common pitfalls and solutions
This hot/warm disaster recovery strategy presents some technical challenges that are not present when using a traditional active-cold disaster recovery solution. Therefore, you should take special care when implementing this disaster recovery strategy. This section describes some of those pitfalls and their corresponding solutions.
Database server time zones
If the distance between the primary and secondary data centers is quite long, it's possible that they will be located in different time zones. In this case, it may be tempting to configure the two database servers into different time zones. However, this configuration is not supported in the IBM BPM V8.0.1. The same time zone must be configured on both database servers. This limitation is documented in the IBM Business Process Manager Information Center.
Messaging engine cluster member start-up
In a three or four cluster deployment environment (the topologies with remote messaging clusters), the messaging engine cluster members must be started before starting the members of the other clusters. If there is only one messaging engine cluster member at the standby site, you may notice that the messaging server can't be started until another server in the same core group is started and that the SystemOut log file repeatedly reports the following information:
Listing 4. Messaging engine server startup
[11/9/12 16:52:24:809 GMT+08:00] 00000007 RLSHAGroupCal W CWRLS0030W: Waiting for HAManager to activate recovery processing for local WebSphere server
This is a known issue in WebSphere Application Server. In order to avoid this issue, ensure that at least two messaging engine cluster members are present in the standby data center. Usually, only one of the messaging engine cluster members will be active at a time.
Cross-cell Enterprise JavaBeans (EJB) look-up failure
When the standby site takes over, name resolution for EJB look-ups may fail. If the application contains an EJB client that looks up BPM EJBs implemented in another WebSphere cell, the lookup may fail as documented in the IBM RedBook®Techniques for Managing Large WebSphere Installations (page 67). For applications making use of this type of EJB interaction (this is not common), the problem can be avoided by creating an additional node agent in the standby data center, assigning it to the same core group as the cluster members and node agents in the primary data center and creating a core group bridge between the two core groups. In order to avoid cross-site network traffic during normal operations, this additional node agent is not started until after a disaster is detected. Figure 8 illustrates this solution.
Figure 8. EJB lookup failure solution
To create the core group bridge, complete the following steps:
- In the Core group bridge settings page on WebSphere admin console,
select Access point groups, as shown in Figure 9.
Figure 9. Core group bridge settings
- Select DefaultAccessPointGroup => Core group access
points, then select
CGAP_1\DefaultCoreGroup, and click Show
Details, as shown in Figure 10.
Figure 10. Core group access points
- Under Additional Properties, click Bridge
interface and then click New.
Figure 11. Access point
- In the Bridge interfaces drop box, select the
standalone node agent and then click OK, as shown in
Figure 12. Bridge interface
- Return to the Core group access points dialog (Figure 10, and repeat the same steps for the other core group, making sure to select the standby data center's node agent as the bridge interface.
Figure 13 shows the final result after saving and restarting the whole environment.
Figure 13. Access point groups
During normal processing, the standalone node agent is stopped. When recovering the standby data center, this node agent must be started before starting other cluster members.
- Process Center:
- In the context of an overall solution topology, the concepts described in this paper apply to both Process Server cells and should be applied to the Process Center cells also. Remember, that in order to maintain and do any changes to process applications that are under the control of Process Center, this too will need to be recovered and made operational on the recovery site.
- Deployment manager:
- Please note that there is no deployment manager on the standby site;
this is different from the typical disaster recovery scenarios that
have been described for IBM BPM in the past. In order to accommodate
this, either file-based replication or a recovery node technique can be used to make the deployment
manager available on the standby site. The article Run time management high availability options, redux has
great details on different approaches to making the deployment manager
Profile backup and restore can be applied as well, but compared with file replication or recovery node approaches, it will take more time to recover the deployment manager at the standby site. So in general, depending on your IT infrastructure and your business requirement on the deployment manager's recovery time, you can choose the approach that best suites your environment.
- Variations on the strategy:
- There is another variation that involves applying the stray node concept while still leveraging SAN replication. While this is possible, it is not the subject of this article. You can contact the authors directly if you are interested in this combination. Our general point of view is that while this should work, it will not provide as great an improvement on RTO because of the number and length of the recovery procedures required as compared to the database-based approach to data replication.
This article described a useful additional topology for enterprise solutions that require reduced recovery times (RTO) for disaster recovery. We've described the concept of a "stray node" and how properly setting up and configuring the primary and the recovery site can allow this stray node concept to enable recovery faster than previously documented methods.
- Asynchronous replication of WebSphere Process Server and WebSphere Enterprise Service Bus for disaster recovery environments
- Using Oracle Database File System for disaster recovery on IBM Business Process Manager
- Everything you always wanted to know about WebSphere Application Server but were afraid to ask
- Oracle Single Client Access Name (SCAN) (PDF)
- WebSphere Application Server Information Center
- Core group transports in the WebSphere Application Server Information Center
- Apache JMeter
- IBM Business Process Manager Information Center
- Technote: CWRLS0030W message continuously logged and WebSphere Application Server fails to open for e-business
- Techniques for Managing Large WebSphere Installations
- Make a quicker and easier recovery from an unplanned deployment manager failover
- Run time management high availability options, redux
- Using Oracle Database File System for disaster recovery on IBM Business Process Manager
- Configure multiple database hosts into connection string
- Database time zone considerations
- IBM Software Services for WebSphere: Find out how IBM expertise in cutting-edge and proven technologies can help you achieve your business and IT goals.
- IBM BPM Journal: Get the latest articles and columns on BPM solutions in this quarterly journal.