Building a highly available WebSphere Business Events event infrastructure

In this article you'll learn how you can develop a highly available architecture using WebSphere Application Server with WebSphere Business Events and WebSphere MQ.

Introduction

IBM WebSphere Business Events (hereafter called Business Events) helps you detect, evaluate, and respond to the impact of events based on the discovery of actionable event patterns. It enables you to define and manage events, so that you can take timely actions. This results in reducing the total cost of ownership through codeless implementation. To ensure that a Business Events deployment is highly available, your implementation should leverage the existing high availability features of WebSphere Application Server Network Deployment (hereafter called Application Server ND).

Currently, additional configuration is required to leverage this Application Server functionality in WebSphere Business Events. In this article, you'll learn how you can develop a highly available architecture using the inherent Application Server ND functionality along with specific application settings in WebSphere Business Events and WebSphere MQ (hereafter called MQ), to provide high availability.


What is high availability?

The term "high availability" has many definitions, but the term most often refers to a system that is available most of the time, for example, one that has minimal downtime, such as a scheduled downtime, which is required for upgrades or maintenance. Availability is measured in percent uptime divided by the total time. High availability is when this number is greater than 99.x, where the x represents between one and three positions of accuracy (for example, 99.999 percent availability is called “five nines availability”).

The terms availability and high availability can have very different meanings depending on the audience. These terms often describe a variety of business goals and technical requirements, from hardware-only availability targets to mission-critical targets.

An available system is one that comprises a set of system-wide, shared resources that cooperate to provide essential services. High availability systems combine software with industry-standard hardware to minimize downtime by quickly restoring services when a system, component, or application fails. While not instantaneous, services are restored quickly--often in less than a minute.

A high availability solution guarantees that a system will automatically recover in the event of a software or hardware failure. The goal of achieving high availability is to eliminate scheduled downtime and minimize unscheduled downtime. Often, organizations have inappropriate expectations regarding availability targets, and they might demand higher levels of availability than they are actually willing to pay for. Once an organization understands the total cost of implementation, many organizations revise their requirements. Implementing a high availability solution includes, but is not limited to, the following costs:

  • Hardware
  • Software
  • Network infrastructure
  • Training
  • Serviceability
  • Operations

Targeting high availability

As downtime approaches 0, high availability approaches 100%. Downtime includes both planned and unplanned downtime.

Table 1. Downtime averages
Percentage Weekly downtime Hours Weekly downtime Minutes Yearly average downtime
90.000% 10 600 28080 minutes
99.000% 1 60 3089 minutes
99.900% 0.1 6 526 minutes
99.990% 0.01 0.6 53 minutes
99.999% 0 0.06 5 minutes

According to Gartner Research Note, ID Number: AV-13-9472, the causes of downtime and their associated probabilities are as follows:

Table 2. Downtime causes and probability
Causes of downtime Probability
Software failures (unplanned) 40%
Hardware failures (unplanned) 10%
Human errors (unplanned) 15%
Environmental problems (unplanned) 5%
Planned downtime 30%

As seen in Table 2, there are two categories of downtime: unplanned and planned. The environmentally-related and hardware failures are the least likely to happen, whereas software failures and planned downtime contribute up to 70% of system downtime.

Based on the above table and the desired high availability of 99.99%, the allowed downtime for the various causes are shown in Table 3:

Table 3. Allowable downtime for various causes
Causes of Downtime Downtime in a year
Software failures (unplanned) 21 minutes
Hardware failures (unplanned) 5 minutes
Human errors (unplanned) 8 minutes
Environmental problems (unplanned) 3 minutes
Planned downtime 16 minutes

Minimizing recovery time

You can minimize hardware failure recovery times by doing the following:

  • Addressing hardware redundancy: Hardware redundancy includes items such as redundant routers, servers, disks, and power supplies. Redundant hardware is not limited to only the physical machines that the architecture is running on, but also the associated infrastructure such as power and cooling facilities.
  • Providing automated detection of failures: To provide true hardware recovery, you need some form of automated detection of failures. This can be provided through software such as High Availability Cluster Multi-Processing (HACMP), Application Server ND, and various other high availability solutions depending upon the platform being implemented and the solution's complexity.
  • Using hot swap hardware facilities: Hardware recovery time can be minimized by using hot swap hardware facilities such as disks, network cards, and so on.
  • Providing a stable and secure environment for the physical location of the system: Environmentally-related problems can be minimized using a stable and secure environment for the physical location of the system. For example, if the server is beneath someone's desk, the chance of failure is greater than if it's in a secure location.

You can minimize human errors by doing the following:

  • Providing a clear system management interface: One of the biggest hindrances for high availability related to human errors is the lack of a clear system management interface. If your support personnel have to go to multiple consoles to manage a process or detect an error condition, the instances of human error increase.
  • Providing a clear system operation procedure: The second issue which causes additional downtime in a failure condition is the lack of a clear system operation procedure. This can be as complex as a enterprise-level disaster plan or as simple as a defined process for a particular application. In either case, you need to store these plans in a convenient place for retrieval. These documents can also be referred to as an operational procedure document. You can minimize planned and unplanned downtime through the use of a well-defined and clear systems operation procedure.

Other ways to improve availability include:

  • Addressing software redundancy
  • Providing automatic detection of software failure
  • Providing automatic recovery of software services
  • Providing software-specific configuration
  • Addressing specific application design

Understanding the Business Events high availability solution

With Business Events V6.2 and V7.0, the ability to provide a high availability environment has dramatically improved from the initial product releases. With Business Events, IBM provides a completely scalable and available solution by exploiting native Application Server ND functionality. However, the technology connectors don't leverage this functionality, so they will need to rely on another method of providing high availability.

The Service Integration Bus (hereafter SiBus) is a logical entity that is created and configured after installation of WebSphere Application Server through either the administrative console, or through the wsadmin scripting language. You can configure the SiBus in two modes on an Application Server cluster, depending on the requirements of the deployment. Because the SiBus is a logical entity based on the physical implementation of a message-driven bean (MDB), there is no inherent high availability or workload management functionality. You need to independently configure the underlying physical implementation prior to creation.

The rest of this article will show a highly available Business Events infrastructure using WebSphere MQ as the external messaging bus. Again, keep in mind that technology adapters are not, by their nature, built to be highly available. However, they can be run in an active passive mode to provide a higher level of availability for these feeder applications. Business Events, by their nature, are transitory items and are usually not valid after some period of time. Therefore, by default, all events are non-persistent in nature, and any event states that need to survive through a failure should be set to be persistent, and defined as durable events, in Business Events.


Overview of cluster configurations

The high availability cluster configuration is the default configuration created when a cluster of application servers in a cell is created. When the SiBus is created, there is only one active messaging engine on one of the cluster servers, and all service requests to cluster members are routed through this single messaging engine. Therefore, for a cluster of n servers, there will be one local message put action for routing the service request on the server with the active messaging engine, and (n-1) remote message put actions for each of the servers with inactive messaging engines.

The advantage of this configuration is that when there is a messaging engine failure, another server becomes active and all remote put actions are routed through the newly active messaging engine. Also, because there is only a single server routing messages to the target service, any message sequencing is persisted through the bus.

The disadvantage of this configuration relates to performance, since there is effectively a messaging bottleneck within the cluster configuration, and (n-1) of all service requests are remote puts to the active messaging engine.


WebSphere MQ high availability and failover

By using MQ along with a high availability service application, such as HACMP, Microsoft™ Cluster Service, or Veritas Cluster Server, you can further enhance the availability of MQ queue managers feeding the Business Events engines along with the technology connectors.

A service set must contain all of the processes and resources needed to deliver a highly available service, and, ideally, should contain only these processes and resources. It is recommended that you use the queue manager and the JMS adapter as the unit of failover for MQ, since Application Server ND will handle the application failover using Application Server high availability. Consequently, for optimal configuration, you should place each queue manager in a separate package, together with the resources that it depends on. The service set should therefore contain the shared disks used by a queue manager, which should be in a volume group reserved exclusively for the package, the IP address used to connect to the queue manager (the package IP address), and an application server, which represents the queue manager.

A queue manager that will be used in a high availability cluster needs to have its logs and data on shared disks, so that they can be accessed by a surviving node in the event of a node failure. A node running a queue manager must also maintain a number of files on internal disks. These files include files that relate to all queue managers on the node, such as /var/mqm/mqs.ini, and queue manager-specific files that are used to generate internal control information. Files related to a queue manager are therefore divided between internal and shared disks. Use a single shared disk for all of the recovery data (logs and data) related to a queue manager.

Figure 1 depicts a typical active/passive configuration. On each of the messaging servers, there is a set of application code that consists of the MQ server, the JMS adapter for Business Events, and the client connection package for Application Server. When a high availability event occurs, this resource group will fail over to the secondary machine and continue processing where it left off.

Figure 1. Typical active/passive configuration
Typical active/passive configuration

Figure 2 depicts the high-level business events solution. It consists of an MQ cluster and a Business Events cluster. The Business Events cluster communicates with the MQ cluster through the use of JMS action and event points or the action queue and event queue. These queue types are each defined differently and each have limitations. Figure 2 shows the configuration that can be used for event streams.

Figure 2. Event stream configuration
Event stream configuration

The default configuration for a Business Events server is to use subscription points. This configuration has an inherent problem, as does any solution which uses subscription to a topic of duplicate events across each subscriber. You can address this problem with a single gateway event engine, but this leads to a single point of failure or an additional layer of complexity which might not be needed in all cases.

The second method of connectivity is to use technology connectors, which use parameters that are defined within the tools for the touchpoints. Through the use of point-to-point queues and having the event stream load balanced by MQ from the inbound side, you can ensure that there are no duplicate events being received by Business Events.

The third method is to configure Business Events to use a static queue for its events. The limitation of this method is that there is no segregation of events on the event server. All events come in on this single queue, are processed, and go to a single action queue. This configuration requires an external broker to route the action to its ultimate destination.

At this point, you should be aware that any in-flight non-persistent and non-durable events will be lost in the event of a failure. In the case of contextual data, this data should be persisted to the steps table and be available to the new Business Events node when it fails over. This configuration has one known limitation: in the case of the JMS technology connector, security must not be turned on. All of the other technology connectors cannot be made highly available at this time due to a limitation in the Technology Connector Framework. We'll address this issue later. If there is a case where you need high availability for the other technology connectors, it is recommended that you use some form of integration bus, such as WebSphere Message Broker or WebSphere Process Server, instead of the associated technology connector.

MQ clustering will provide load balancing for all of the inbound events, and the JMS adapter and queue manager will fail over as a pair to the passive node in the availability cluster. However, using MQ clustering will cause multiple hops within the Business Events server cluster for contextually based events, such as a context object or a contextual summary intermediate object, since these events are associated with a single Business Events node. Consequently, you should expect some additional latency with these type objects.


When you require a more robust high availability environment

When you require a more robust high availability environment, you should configure a secondary idle container on each Application Server instance for the rollover to fail to. When the event is received into the Business Events engine, additional load balancing will occur since this is part of the base functionality of Business Events clustering. Set up the steps table to use the ObjectGrid for replication; the table should be backed up by a datastore such as DB2® or Oracle®. With this configuration, the data will be replicated across all nodes in the cluster. This context data will then be used to recover from the point of failure when durable events are used in the case of a failure event. These replications are not active in the cluster until the contextual node for an event is no longer available. At this point, ObjectGrid assigns a new context handler for the event, and processing continues. Only one context handler is available at any one time in the cluster for a particular event. This is what causes the multiple hops mentioned above.

Figure 3. Robust high availability architecture
Robust high availability architecture

When you're planning your high availability event infrastructure, keep in mind that only durable events will be preserved across a failover event. Once an event has been consumed off the queue, these events are not rolled back.


High availability architecture

The following configuration sets up an instance for each node in the Business Event nodes running under an Application Server ND cluster, along with a paired passive machine for each node of the cluster. This separation allows one node to undergo standard maintenance while the other node continues processing. Additionally, the pairing of the failover node allows for the node and all of its resources, including MQ, to fail over as a single unit to the passive machine. The MQ queues, which are used for inbound and outgoing messages, will be stored on a common storage array network and standard high available procedures can be used for failover. Figure 4 shows a common platform layout for a production region.

Note: Because the event connectors are not used in this configuration, the inbound JMS traffic is limited to a single queue for either durable or non-durable events.

Figure 4. Architecture using no Tech Connectors
Architecture using no Tech Connectors

Note: MQ security cannot be enabled since the technology connectors do not currently support this enablement in a way that provides high availability. Solution messages could potentially become stranded if MQ and the technology connector are not failed over as a pair with the logs and queues stored on a storage array network drive for failover. The Diagram below shows how this architecture would work.


Configuring using technology connectors

The following configuration sets up a machine for each node in the Business Event nodes running under an Application Server ND cluster, along with a single passive machine. This separation allows one node to undergo standard maintenance while the other node continues processing. When a event occurs, the passive machine will be started and the processing node joins the MQ cluster and begins to receive events. This enables additional processing to be performed on an as-needed basis by starting and stopping the instance, and allowing it to join the cluster.

Another variation on the above design is to place MQ and the technology connectors in a high availability server group, and have them fail over as a pair to the passive box. This requires configuration of the technology connector to publish to a Business Events topic not necessarily on the same machine. It also requires that MQ place its queues and logs onto a storage array network much like the primary solution shown for production.

Note: MQ security cannot be enabled since the technology connectors do not currently support this enablement in a way that provides high availability. Solution messages could potentially become stranded if MQ and the technology connector are not failed over as a pair with the logs and queues stored on a storage array network drive for failover.

Figure 5. Technology connector high availability
Technology connector high availability

Business Events JMS settings of interest

Table 4. WebSphere ND JMS settings for Business Events.
Topics Name Change to static queue
Action topic jms/actionTopic Yes can be a static queue
Durable action topic jms/durableActionTopic Yes can be a static queue
Durable event topic jms/durableEventTopic Yes can be a static queue
Event topic jms/eventTopic Yes can be a static queue

You can find the following queue settings in the JMS configuration dialog in the Business Events administration console. As shown in Table 4, all of these JMS entries can be either pub-sub or static queue. By default, they are pub-sub. If technology connectors are not used, you'll need to manually configure these JMS entries to use static queues instead . Also, note that all messages need to be in the Business Events V6.2 format, as described in the WebSphere Business Events V6.2 Information Center. You are also limited to two incoming and two outgoing queues, one durable and one non-durable. If you wish to use these queues directly, it is recommended that you place either WebSphere Message Broker or WebSphere Process Server in your architecture to handle the transformation and routing of events into and out of the Business Events server.


Summary

This article demonstrated how you can leverage Business Events in a high availability environment by using the built-in functionality of Application Server, along with additional settings and configuration options, to enable a highly available enterprise event infrastructure.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Business process management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Business process management, WebSphere
ArticleID=495724
ArticleTitle=Building a highly available WebSphere Business Events event infrastructure
publish-date=06182010