Approaching continuous availability in WebSphere Process Server V7

Maintaining availability during application updates and fix pack installations

This article provides background, insights, and a pragmatic set of techniques for installing application updates and product fix packs in WebSphere® Process Server V7.0 environments where continuous availability is desired. This content is part of the IBM Business Process Management Journal.

Share:

Jacob (Jake) Stoeffler (stoefflerj@gmail.com), Software Engineer Intern, IBM China

Jake Stoeffler photoJacob Stoeffler is a student at the University of Wisconsin-Madison, studying computer science. He spent part of 2012 as an intern at IBM in Rochester, MN working in the BPM and ODM CTO office.



Eric Herness (herness@us.ibm.com), Distinguished Engineer, IBM

Eric Herness photoEric Herness is an IBM Distinguished Engineer and is the Chief Architect for business process management (BPM) in IBM Software Group. Eric is also the CTO for the business unit focused on BPM and operational decision management (ODM), where he leads the architects who define product and technical direction for the business.

Eric has worked with many large customers as they have adopted BPM and ODM approaches. He has had key lead architectural roles in WebSphere for more than 15 years. Eric has an MBA from the Carlson School at the University of Minnesota.



22 May 2013

Also available in Chinese

Introduction

Business process management applications continue to become more and more mission critical as organizations gain experience with how to best leverage processes to help run and improve their business. This means that availability requirements for these applications often approach 24/7. BPM also encourages continuous process improvement, which means changes to applications are coming more rapidly and in a variety of forms. Mechanisms and techniques to minimize downtime and approach continuous availability are needed. These techniques need to be as automated as possible, allowing for consistency, speed, and reduced chance for a finger check or human error.

This article provides background, insights and a pragmatic set of techniques for approaching continuous availability in WebSphere Process Server V7.0.

First, topology for WebSphere Process Server is reviewed. Special considerations that prepare and set the stage for applying the approach suggested by this article are highlighted and detailed. Then application background is presented. Because there are many different kinds of artifacts involved in an IBM BPM solution running on WebSphere Process Server, different techniques are applied to achieve the desired results. There are some constraints and special settings required at the solution and application level as well. A detailed explanation is provided for each scenario, including the steps involved in executing these scenarios. This mainline description of the scenarios executed and the basic techniques applied are used as a baseline for exploring variations related to store-and-forward and alternative topology configurations. Additional insights into the scenarios executed and some implications are also enumerated. Finally, a summary of how to apply a similar approach to product fix packs is described.

Before attempting any of the procedures outlined here in a production environment, we highly recommend that you first test them thoroughly. In addition to the topology, applications must be created in a manner that enables the pursuit of continuous availability. This article provides some specific information, but assumes that the application handles interface changes, required tables in a database, and other resource requirements appropriately. Another way to describe this is to recognize that some application changes may break compatibility and thus require downtime. The focus of this article is on compatible changes. For example, operations are added, not removed or changed. Application-specific databases are not changed. Removing operations and changing databases are the kinds of changes that are outside the scope of this article.


Background of Process Server topology

Figure 1 shows what a typical WebSphere Process Server configuration may look like. For the sake of simplicity, throughout this article we will assume a single-cell topology similar to this one. We cannot possibly cover each and every configuration scenario, but we hope to provide a general set of practices that can be applied to the vast majority of configurations.

Figure 1. A two-node, four-cluster topology with IBM HTTP Server
A two-node, four-cluster topology with IBM HTTP Server

Any WebSphere Process Server topology utilizes a set of four key functions: application deployment target, messaging infrastructure, supporting infrastructure, and web infrastructure. A cluster is a set of servers that performs one or more of those four functions. It is possible to have one cluster perform all four of the key functions. However, the chances of attaining continuous availability increase if you separate these functions across multiple clusters to increase resiliency.

In the topology shown in Figure 1, there is a separate cluster dedicated to each of the four functions. The application deployment target cluster (AppTarget) consists of the servers on which your applications are installed. These applications may include business processes, services, human tasks, and mediations. The remote messaging cluster (Messaging) provides support for asynchronous messaging for your applications and for the needs of internal Process Server components. The support infrastructure cluster (Support) provides function that is complementary to the deployment target, and as a separate cluster it provides isolation and relieves specific workload from the deployment target. Finally, the web component cluster (WebApp) hosts web-based client applications such as Business Space, Business Process Choreographer tools, and REST API services.

This topology consists of a deployment manager and two Process Server nodes. The role of the deployment manager is to manage the cell and to provide an interface for configuring the various components in the cell. The deployment manager is also responsible for installing and updating applications, and therefore it is of vital importance when we strive to achieve continuous availability.

A node hosts one or more application servers; in this topology, each node hosts one member of each of the four clusters. Each node is remotely managed by the deployment manager via a node agent. The deployment manager communicates with a node's node agent, which in turn communicates with that node's application servers.

In WebSphere Process Server, applications are installed via the deployment manager using either the Integrated Solutions Console or the wsadmin command-line scripting tool. When an application is installed or updated, the new application's artifacts are first stored in the cell's master repository. The deployment manager then passes the new application's artifacts to each node's node agent. Then each node agent updates the application on the node's application servers.

IBM HTTP Server is commonly used for handling HTTP traffic in IBM BPM environments. IBM HTTP Server allows you to configure traffic routing to application servers using a file named plugin-cfg.xml. A client connects to the HTTP server, which routes the client's requests to one of the cell's application servers, depending on the load that each server is configured to handle. We will cover IBM HTTP Server configuration in more detail when we discuss updating asynchronous applications.

In this basic overview of WebSphere Process Server topology, we have attempted to review at a high level what you will need to know in order to understand and successfully use the procedures covered later. For a more in-depth look at Process Server topologies, including installation and configuration steps, please refer to the IBM Redbook WebSphere Business Process Management V7 Production Topologies.

The role of topology in continuous availability scenarios

A key to approaching continuous availability in WebSphere Process Server is minimizing downtime during application updates. Our goal is to update an application without the clients experiencing any downtime and preferably without noticing anything at all. For some types of applications this is simple and requires no special preparation. For most Process Server applications, however, a zero-downtime update requires planning and a special procedure that involves routing incoming application traffic to and away from individual nodes. We will cover this procedure in detail later, but first we will outline the Process Server topology that makes it possible.

A prerequisite to continuous availability is what we call high availability. To approach continuous availability, we recommend a topology similar to the one shown in Figure 1: a two-node, four-cluster topology with one member of each cluster on each node. At a minimum, you must have an application target cluster that is separate from the messaging cluster. This ensures that the messaging engine can fail over independently of the application target servers. You must also have at least two separate nodes that host members of the application target and messaging clusters. This allows you to stop the servers on one node and update an application while the other node handles all of the application traffic. These recommendations provide not only the foundation for approaching continuous availability, but also serve as a solid base for high availability.


Background of Process Server applications

WebSphere Process Server v7.0 applications are usually organized into modular units that are developed and deployed as Service Component Architecture (SCA) modules. A module can contain a variety of SCA components, the basic building blocks to encapsulate business logic which are exposed to other components through an interface. Commonly used component types in WebSphere Process Server include Business Process Execution Language (BPEL) processes, mediation flow components (MFCs), and service components.

WebSphere Process Server application modules are typically developed in WebSphere Integration Developer. Once a module is developed, it is exported as an enterprise archive (EAR) file. The EAR is usually deployed first to a WebSphere Process Server test environment where it is tested both in isolation and with the application as a whole. Once the module has passed thorough testing, it is deployed to a production environment. Although the main focus of this article is on this deployment phase, there are some concepts at the application development level that must be understood in order to successfully approach continuous availability.

SCA invocation styles

SCA components can be invoked either synchronously or asynchronously. Synchronous invocation means that the caller is blocked until a response is received from the component being called. Since the service requester and the service provider run in the same thread, all processing by the requester is suspended until it receives a response from the provider.

Figure 2. Synchronous SCA invocation
Synchronous SCA invocation

Synchronous SCA invocation is useful if the requester is dependent upon receiving a response from the provider in order to continue processing.

Asynchronous invocation allows the caller to invoke a service without waiting for a response to be produced right away. The service requester and the service provider run in different threads, so the requester can continue processing while the provider prepares a response.

Figure 3. Figure 3. Asynchronous SCA invocation
Asynchronous SCA invocation

Asynchronous invocation is useful if (1) the provider may take a long time to respond (minutes, hours, or days) or (2) the requester has further processing it can perform that does not depend on the information returned from the provider. There are three flavors of SCA asynchronous invocation that can be used in WebSphere Process Server applications: one-way invocation, callback invocation, and deferred response invocation. For a review of these asynchronous invocation styles and how they apply to WebSphere Process Server, see the developerWorks article Asynchronous Processing in WebSphere Process Server.

Early binding and late binding for BPEL process invocation

A special case to consider when discussing invocation is the invocation of BPEL business processes. A client (caller) of a BPEL process can be configured to use either early or late binding for invocation. Early binding means that the client is hard-wired to a specific version of the process, and it will only invoke that version. When updating a process that is invoked with early binding, you must also update the client to use the new process version. With late binding, a client will always invoke the process version that is most current. The decision about which process version the client invokes is made dynamically at runtime.

Early and late binding can be configured in several ways depending on how the client invokes the BPEL process. One way that BPEL processes are invoked is via the Business Process Choreographer API. In this case, controlling the type of invocation binding is as simple as calling the proper invocation method. In the Business Process Choreographer API, invocation methods generally have two versions with different signatures: one that takes a process template name and one that takes a template ID. The methods that take a template name as a parameter use late binding while methods that take a template ID use early binding.

BPEL processes can also be invoked via an SCA wire from a calling component to the process component. By default, invocation of a BPEL process from an SCA component is early bound because the SCA wire is tied to a specific version of a process component. It is possible to invoke late bound via SCA by using a "proxy process." Details on this technique can be found in Creating versions of your process to be used with SCA components and exports in the WebSphere Integration Developer Information Center. This is generally not recommended because it creates unnecessary process instances which can affect performance.

A third case is the invocation of a BPEL process from another BPEL process. This is done by adding an invoke activity as part of the calling process. To make the invocation early bound, use an SCA wire to connect the two process components. To use late binding, do not use a static SCA wire; instead, specify the template name of target process as part of the reference partner properties of the invoke activity. For more information, see Late binding using a partner link extension in the WebSphere Integration Developer Information Center.


Updating a BPEL business process

This section applies if you only need to update a BPEL business process within a Process Server application. If changes to other types components within SCA modules are required, please refer to the next two sections regarding updating SCA modules.

Thanks to BPEL versioning, it is quite simple to update a BPEL process while maintaining continuous availability of your application. The only requirements are that (1) the caller of the process is configured to use late binding and (2) the old and new versions of the process have matching component names and target namespaces. These requirements are considered normal and best practice, so there is no significant limitation or inconvenience here. If these requirements are fulfilled, the new process version will be seamlessly picked up the moment that it is set to become valid.

When updating a BPEL process you also need to consider whether or not already running process instances should be migrated to the new process version. Although this generally doesn't affect availability, we would suggest doing instance migration incrementally so as not to overload your system.

If you decide to migrate running instances, follow the guide Create a new version of your process – migrate running instances to create your new process version. Otherwise, follow the steps in Create a new version of your process – running instances use the old version. We recommend setting the "valid-from" time such that the new process version can be installed before it becomes valid.

Once the new process version is created, you can deploy the uniquely named SCA module just as you would any other module, using either the Integrated Solutions Console or wsadmin scripting. Be sure to install the new process module as a new application rather than as an update to an existing application. If you follow these steps, your new process version will be picked up without a hitch the second it becomes valid. The old process module can be safely removed if you are sure that no other clients are early bound to that process version, there are no running instances, and there are no forgotten old failed events or messages.


Updating an SCA module in a synchronous Process Server application

It is also rather straightforward to update an SCA module within a strictly synchronous application while maintaining continuous availability. By "strictly synchronous," we mean that all invocations within the application are synchronous. The majority of Process Server applications do not fit into this category. For applications that contain any asynchronous invocations, we recommend that you use the procedure described in the next section of this article.

Our tests of synchronous application updates indicate that continuous availability can be achieved during an in-place update without the use of any complicated procedures or special configurations. We highly recommend, however, testing this with your own application in your own test environment before attempting it in a production environment.

The steps for this in-place module update are nearly the same as for any other Process Server application update. First, make the modifications to your SCA module as you normally would. You can then deploy your updated module using either the Integrated Solutions Console or wsadmin scripting.

You may be accustomed to stopping your applications or your servers before applying updates, but you will not do so here. To deploy the updated module, simply use WebSphere's Update feature, then save and synchronize the changes with nodes. This installs the updated module to your servers simultaneously without any downtime. We do not recommend using the Rollout Update feature in this case. Rollout Update automatically pauses or stops each application server to apply the update sequentially, but we do not want to pause or stop any servers in this case because that would result in a period of unavailability.

As long as all invocations in your application are synchronous, there is very low risk of failure using this method. If you do encounter difficulties while testing this, please refer to the next section for a technique that can be safely used with all application types.


Updating an SCA module in a Process Server application that uses asynchronous or mixed invocation styles

Approaching continuous availability becomes slightly more complicated when you need to update an SCA module in an application that uses asynchronous invocation. The procedure we will discuss in this section pertains to all Process Server applications that contain asynchronous invocations or a mixture of synchronous and asynchronous invocations. This includes applications that use any of three flavors of asynchronous invocation mentioned earlier in this article.

The reason that approaching continuous availability is more complicated in this case is that asynchronous invocation requires messaging support. Multiple transactions are inherent in any asynchronous model. Queues are also part of the picture. These elements of the architecture combine to make this a more complex scenario than just straight synchronous invocations. During an update, the destination queues of an asynchronously invoked SCA module may be marked for deletion, and any messages contained in those queues could be lost. This means that a direct in-place update is risky if the application is actively processing work. Therefore, for updates of this type we must be able to control the flow of traffic and update the application on a particular node only after work has ceased. Our approach will be to perform the application update one node at a time to maintain availability at all times.

First we will give some background on various WebSphere and WebSphere Process Server concepts that apply specifically to this update procedure. Make sure that you fully understand these concepts before attempting the procedure. Since it will be to your advantage to automate the procedure to avoid inconsistencies and minimize the time required, we will provide sample scripts along the way. We will then detail the update procedure itself at the end of this section.

Node synchronization

We can control when a node will receive a new version of an application by disabling or enabling node synchronization. If we disable node synchronization on a node, the node will not be aware of changes in the master repository of application artifacts, and thus no applications on that node will be updated. When we want an application update to occur, we re-enable node synchronization and when synchronization is triggered the new artifacts are pulled down to the node's servers.

We have found that it is necessary to restart the node agent after disabling or enabling node synchronization in order for the change to take effect; thus we also provide sample scripts for disabling, enabling, and restarting a node agent in the Download section of this article.

Routing incoming HTTP traffic

Inbound traffic to WebSphere Process Server can come in a variety of forms, such as HTTP, JMS, and MQ traffic. In this section, we will discuss HTTP traffic. We need the ability to route inbound HTTP traffic to servers on certain nodes and temporarily prevent traffic from reaching servers on other nodes. We will demonstrate how to do this using IBM HTTP Server because of its simplicity and widespread use in BPM environments.

IBM HTTP Server sprays web requests among application servers according to its HTTP plugin configuration file, plugin-cfg.xml. For a detailed explanation of how this works, please refer to the technote Understanding IBM HTTP Server plug-in Load Balancing in a clustered environment.

Of importance to us is the LoadBalanceWeight attribute of the <Server> element. By setting a server's LoadBalanceWeight to 0, the server will no longer receive requests from new sessions. Affinity requests from existing sessions may continue to be routed to the server if session replication is not configured. However, all requests from new sessions will be routed to servers with LoadBalanceWeight greater than 0.

One way to achieve this is to have multiple plugin-cfg.xml files readily available on your IBM HTTP Server and swap them out when necessary. For example, let's assume we have two application servers that requests are normally routed to: A.App and B.App. In this case, we would need three XML configuration files: one file that allows requests to both servers (both.xml), one file that routes requests only to A.App (a.xml), and one file that routes requests only to B.App (b.xml).

In normal operation mode, the IBM HTTP Server plugin-cfg.xml would contain the contents of both.xml. When the time comes to perform an application update on one of our servers, we would simply swap out the contents of plugin-cfg.xml with one of our other configuration files. For example, if we want all requests to be routed to A.App, we replace plugin-cfg.xml with a.xml. IBM HTTP Server seamlessly picks up the configuration change and stops routing requests to B.App. Normally there is a delay before IHS detects configuration changes, but we can use the following command to gracefully reload the configuration immediately:
IBM/HTTPServer/bin/apachectl -k graceful
(See ihs_route_to_node_a.sh provided for download.)

Routing other traffic

After inbound HTTP traffic has been rerouted, a cluster member can still receive new work via JMS or MQ traffic. A cluster member cannot shut down until it has finished processing work; thus we also need a way to stop new JMS and MQ traffic from flowing in. We can do this by deactivating a cluster member's J2CMessageEndpoints. We demonstrate this in the script quiesce_traffic_jms_mq.jacl provided for download.

If BPEL processes are installed and running, we also need to stop BPEL scheduler generated traffic. This can be accomplished by stopping the BPEScheduler as demonstrated in the script quiesce_traffic_bpel.jacl provided for download.

If you have created any other schedulers, you should also stop them. You can script this as demonstrated with the BPEScheduler script above.

Stopping and starting servers gracefully

Once all incoming traffic has stopped, we can safely stop the cluster members on a node. We recommend stopping a node's cluster members in the following order: application target cluster, web cluster, support cluster, and messaging cluster. This can either be scripted with wsadmin or done through the Integrated Solutions Console, but again we recommend that you script as much as possible for higher consistency. You can use the script stop_cluster_members.jacl provided for download.

After updating an application on a node, we will need to start its cluster members again. We recommend starting cluster members in the reverse order of stopping them; that is, messaging cluster, support cluster, web cluster, and application target cluster. After restarting a cluster member, there is no need for you to reactivate J2CMessageEndpoints or restart the BPEScheduler, as this will all be done automatically at server startup. (See start_cluster_members.jacl provided for download.)

Failed events

During our testing of this procedure in WebSphere Process Server V7.0.0.3, we occasionally saw SCA failed events occur during messaging engine failover. It is our experience, however, that the failed events can be resubmitted successfully after the update is complete. The failed event manager is accessible in the Integrated Solutions Console, but it is also possible to resubmit failed events using wsadmin scripting. (See resubmit_failed_events.jython provided for download.)

Store-and-forward

If you want to minimize the generation of failed events, you can use the store-and-forward feature that is new in WebSphere Process Server V7.0. If this feature is used, only one failed event is generated if there are runtime errors. Once one runtime error occurs, a store is triggered, and all of a service's subsequent requests are stored in a queue rather than being submitted. You can later forward these requests to their destinations using the Store and Forward widget in Business Space. There is currently no public store-and-forward API that supports changing store/forward state, so we do not advocate scripting this step, though we have verified that it is possible to do so. For more details on how to use store-and-forward, please see the developerWorks tutorial Using the store-and-forward feature in WebSphere Process Server v7.0.

Application-level settings

While testing this procedure in WebSphere Process Server V7.0.0.3, we identified a few application-level settings that were most successful for asynchronous applications with certain attributes. Most failures we encountered occurred during failover of the messaging engine between cluster members. For the update procedure to be successful, it is important that your application gracefully handles messaging engine failover. Following the recommendations we give here will be a good start, although, given the variability of IBM BPM applications and environments, it is possible that your scenario will require slightly different settings. Therefore, we specifically recommend that you test your application while under load during messaging engine failover to ensure it is handled properly and important messages are not lost.

First, if your module contains any mediation flow components, ensure that the fail terminals are connected to a fail error handling component. This ensures that failed events are saved and that failed transactions are rolled back when necessary. If your module uses store-and-forward, we recommend setting the store-and-forward qualifier such that all ServiceRuntimeExceptions are caught. There are many types of runtime exceptions that could occur during the update procedure, and you want to be sure to catch all of them. If your module contains any one-way asynchronous invocations, we recommend setting the reference's asynchronous invocation qualifier to call rather than commit (call is the default setting). Lastly, we recommend setting the reference's asynchronous reliability qualifier to assured (persistent) for any type of asynchronous invocation. These last two recommendations help to ensure that no messages are lost in the case of failures during the update procedure.

We highly recommend these settings if your asynchronous module has any of the attributes described above. Again, we urge you to test thoroughly because what is necessary for one application and environment may be different for another.

Update procedure for SCA modules within asynchronous applications

Now that we have discussed the important elements of the update procedure individually, we will detail the procedure itself. Follow these steps to approach continuous availability during module updates for applications containing asynchronous or mixed-style invocations. After each step is a figure that reflects the state of the cell's components after the step is completed.

  1. Before starting the update, the Process Server cell is in its normal operational state.

    All JVMs are running, including the deployment manager, all cluster members, and all node agents. Automatic node synchronization is enabled for both nodes, as represented by the blue lines between the node agents and the deployment manager. The messaging engine (ME) is active on the Messaging cluster member of Node A. HTTP traffic is routed to all members of the AppTarget cluster, as shown by the orange dashed lines. Schedulers such as the BPEScheduler are running. Version v1 of appX is currently deployed to the application target of both nodes and is also seen in the master repository. We will deploy v2 of appX to the application target cluster, one node at a time. The Support and WebApp clusters are not shown here because we are assuming that the application is only installed to the AppTarget cluster. As we move from step to step we will highlight changes in red.

    Figure 4. Beginning state
    Beginning state
  2. Disable node synchronization and restart the node agents for all nodes on which the application is installed.

    We disable node synchronization so that the nodes do not immediately receive the new version of the application when it is updated using the deployment manager. We will update the application on each node only once the node's cluster members have been gracefully shut down.

    This is shown in Figure 5 by the absence of the blue lines connecting the node agents and the deployment manager.

    Figure 5. Disable node synchronization for all nodes
    >Disable node synchronization for all nodes
  3. Update the application module(s) using the deployment manager. Do not synchronize changes with nodes.

    Install the update using wsadmin scripting or the Integrated Solutions Console. Use the Update feature to update the existing module rather than installing the update as a new module. Save changes to the master repository, but do not synchronize changes with the nodes.

    As a result of this step, v2 of the module is in the master repository but is not in each node's local configuration. Here we demonstrate updating only one module, but multiple modules can be updated at once as long as the changes are backwards compatible (see step 3d).

    Figure 6. Install update to master repository
    Install update to master repository
  4. For each node on which the application is installed, perform the following steps, one node at a time. We will illustrate this with Node A only, but these steps would be repeated for Node B to complete the update.
    1. Stop all incoming traffic to cluster members on the node. This includes HTTP, JMS, MQ, and BPEL traffic. Unless session replication is configured, if there are still active HTTP sessions tied to cluster members, you should wait for them to close if they could contain any critical data.

      We stop traffic to cluster members on the node because we need to stop the cluster members gracefully in the next step. The cluster members will not shut down until all work has finished; thus, assuming that there is constant incoming traffic, we must redirect that traffic to allow the servers adequate room to shut down. Earlier in this section we described how to do this for HTTP, JMS, MQ, and BPEL traffic.

      This step is illustrated below by the change in plugin-cfg.xml. As a result of the change, all HTTP requests are routed to the cluster member on Node B.

      Figure 7. Stop incoming traffic to cluster members on the node
      Stop incoming traffic to cluster members on the node
    2. Stop cluster members that the application utilizes on the node.

      Be sure to stop the cluster members in the order given earlier in this section: application target, web, support, and messaging. For example, if your application is deployed to the application target cluster, first stop the application target cluster member. Then, since this is an application that uses asynchronous invocation, stop the messaging cluster member. If the messaging engine is currently active on a cluster member on the node, it will failover to a different cluster member. This could take a few seconds or a few minutes, depending on your environment, but you need to wait for the failover to finish before continuing.

      We stop the cluster members because we do not want to process in-flight work while the module is being updated on the node.

      The darkened boxes in Figure 8 show the cluster members that are stopped. Note that the messaging engine is now active on the messaging cluster member of Node B.

      Figure 8. Stop cluster members on the node
      Stop cluster members on the node
    3. Enable node synchronization and restart the node agent. Trigger synchronization if the node is not configured to synchronize at startup. Wait for node synchronization to complete before moving to the next step.

      As a result of node synchronization, the node agent receives the changes from the master configuration and updates the node's local configuration.

      This step is shown in Figure 9 by the red line connecting the node agent to the deployment manager. v2 of the module is now present on the node's application target cluster member.

      Figure 9. Enable node synchronization for the node
      Enable node synchronization for the node
    4. Start all cluster members that you stopped in step 3b.

      Now that the node contains the updated application, we can start its cluster members. Remember to do this in the correct order: messaging, support, web, and application target.

      When the cluster members are started, the MDBs and schedulers are automatically started. Since we will re-enable HTTP traffic in the next step, there is no need to manually restart anything at this point other than the cluster members themselves.

      As shown in Figure 10, there may be a small period of time in which the nodes are simultaneously running two different versions of the application. In this example, v1 and v2 of Module_4 are simultaneously putting messages in the destination queue for Module_5. As long as the changes are compatible as we described earlier, this will not be a problem. For example, if Module_5 needs to be updated in order to be compatible with Module_4 v2, its new version must also be compatible with Module_4 v1.

      Figure 10. Mixed module versions during transition period
      Mixed module versions during transition period

      To make this time period as short as possible, you should stop the cluster members on the next node as soon as this node's cluster members have started. If you script this, the transition can be made in a matter of seconds, assuming you only have two nodes. If the application needs to be updated on more than two nodes, you have the option of waiting to start nodes running the new version until all nodes running the old version are stopped. In any case you should test running different versions simultaneously before attempting this. If you find any compatibility issues, you should not use this update procedure.

      This step is depicted in Figure 11 by the green cluster members, which signify that they are now running again. Note that the messaging engine is still active on Node B's messaging cluster member. If the cluster member on Node A were configured to be the preferred server, the messaging engine would become active on Node A at this point.

      Figure 11. Start cluster members on the node
      Start cluster members on the node
    5. If you disabled incoming HTTP traffic in step 3a, enable incoming HTTP traffic to cluster members on the node.

      If there is another node that needs to be updated, you can combine this step with step 3a for the next node by properly crafting plugin-cfg.xml. This would help to minimize the time period in which different versions of the module are running at the same time.

      Figure 12. Enable incoming HTTP traffic
      Enable incoming HTTP traffic
    6. Repeat steps a through e for any remaining nodes on which the application has not been updated.
  5. Resubmit any failed events that may have occurred during the procedure. If your application utilizes store-and-forward, forward any events that may have been stored.
  6. Update is complete!
    Figure 13. End state
    End state

Variation: Process Server fix pack installation

By extending the application update procedure described in the previous section, we can also approach continuous availability during WebSphere Process Server fix pack installation. The general approach remains the same: apply the update to one node at a time, routing traffic away from the node while the fix pack is being installed.

Note that in the case of an upgrade the health of the whole cell is at risk rather than the health of only one application. Therefore, we strongly advise in this case not to attempt this procedure in a production environment before testing it thoroughly in a similar test environment.

Also note that we have only tested this procedure when upgrading from V7.0.0.4 to V7.0.0.5, so we cannot guarantee that the same approach will be successful in other cases. One assumption we make is that no database updates are required for the fix pack. This is because we will have a "mixed cell" at some point during this procedure; that is, clusters will have members running different Process Server versions at the same time. This could cause severe problems if there is a database incompatibility.

We will give you an outline of the procedure we used to achieve continuous availability during an upgrade from V7.0.0.4 to V7.0.0.5. First read Special Instructions for WebSphere Process Server and WebSphere Enterprise Service Bus V7.0.0 Fix Pack 5 (V7.0.0.5) for official instructions to upgrade with minimum downtime. The method we present here will be a modification of those instructions, essentially accomplishing the same end result (an upgraded cell), but in a way that allows for near-continuous availability. For each step, please refer to the corresponding step in the official instructions for more details.

Process Server fix pack installation procedure

  1. Before starting the upgrade, the Process Server cell is in its normal operational state.

    To demonstrate this procedure we will use the deployment environment shown in Figure 14, which consists of a deployment manager node and two custom nodes, both running WebSphere Process Server V7.0.0.4. Because the nodes are hosted on separate machines, each can be upgraded individually. At this point, all JVMs are running, including the deployment manager, all cluster members, and all node agents. Automatic node synchronization is enabled for both nodes, as represented by the blue lines between the node agents and the deployment manager. The messaging engine (ME) is active on the Messaging cluster member of Node A. HTTP traffic is routed to all members of the AppTarget cluster, as shown by the orange dashed lines. For brevity we are only showing the application target and messaging clusters, but in reality there may be more clusters. We will show this diagram periodically as we move through the procedure, highlighting changes in red.

    Figure 14. Beginning state
    Beginning state
  2. Disable node synchronization and restart the node agents of all nodes.
  3. Stop the deployment manager.
  4. Install the fix pack to the deployment manager's installation root.
  5. Start the deployment manager.
    Figure 15. Fix pack installed to deployment manager
    Fix pack installed to deployment manager
  6. For each node you wish to upgrade, complete the following steps, one node at a time:
    1. Stop all incoming traffic to cluster members on the node. This includes HTTP, JMS, MQ, and BPEL traffic. If there are still active HTTP sessions tied to cluster members, wait for them to close.
    2. Stop all cluster members on the node.
    3. Stop the node agent of the node.
      Figure 16. Cluster members and node agents stopped on the node
      Cluster members and node agents stopped on the node
    4. Install the fix pack to the node's installation root.
      Figure 17. Fix pack installed to node
      Fix pack installed to node
    5. From the deployment manager, run the profile upgrade script for each cluster that this node contains members of. This script should only be run once per cluster per cell, so if all clusters have already been upgraded, skip this step.
    6. If Business Space is configured and the templates/spaces are hosted on a cluster member on this node, perform steps 1a and 1b of Updating Business Space templates and spaces after installing or updating widgets. That article will also help you determine which cluster member the templates and spaces are hosted on. We recommend that the first node you upgrade is the one that hosts the templates and spaces.
    7. Enable node synchronization and restart the node agent. Trigger synchronization if the node is not configured to synchronize at startup. Wait for node synchronization to complete before moving to the next step.
    8. Start all cluster members that you stopped in step 5b.
    9. If you disabled incoming HTTP traffic in step 5a, enable incoming HTTP traffic to cluster members on the node.
      Figure 18. Cluster members started, node synchronization enabled for the node
      Cluster members started, node synchronization enabled for the node
    10. Repeat steps a through i for any remaining nodes to which the fix pack has not been applied.
  7. Resubmit any failed events that may have occurred during the procedure. If your module utilizes store-and-forward, forward any events that may have been stored.
  8. Fix pack installation is complete!
Figure 19. End state
End state

Conclusion

By carefully planning topology and properly crafting scripts that update business process solution components, you can greatly improve availability. This enables you to leverage BPM applications in environments where near continuous availability is required. In addition to the topology, we have outlined a set of application design guidelines and given procedures for updating various components of Process Server applications. While the techniques applied to approach continuous availability will need to be altered to fit the specific installation and configurations that might exist in a particular organization, the overall idea should be clear and well understood.

In the future, expect to see additional information on how we approach a similar level of availability using IBM BPM V8 or later. When using IBM BPM V8 in an EAR-based "Process Server only" mode, the approach and details provided in this article should apply equally well. In that environment, however, you will have new challenges and new opportunities given the presence of the Process Center and the addition of different types of authored artifacts contained in process applications and toolkits.


Acknowledgements

The authors would like to thank Karri Carlson-Neumann for her reviews and suggestions for this article.


Download

DescriptionNameSize
Scripts for use with this articlescripts.zip4KB

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Business process management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Business process management, WebSphere
ArticleID=930186
ArticleTitle=Approaching continuous availability in WebSphere Process Server V7
publish-date=05222013