IBM WebSphere Business Events (hereafter called Business Events) helps you detect, evaluate, and respond to the impact of events based on the discovery of actionable event patterns. It enables you to define and manage events, so that you can take timely actions. This results in reducing the total cost of ownership through codeless implementation. To ensure that a Business Events deployment is highly available, your implementation should leverage the existing high availability features of WebSphere Application Server Network Deployment (hereafter called Application Server ND).
Currently, additional configuration is required to leverage this Application Server functionality in WebSphere Business Events. In this article, you'll learn how you can develop a highly available architecture using the inherent Application Server ND functionality along with specific application settings in WebSphere Business Events and WebSphere MQ (hereafter called MQ), to provide high availability.
The term "high availability" has many definitions, but the term most often refers to a system that is available most of the time, for example, one that has minimal downtime, such as a scheduled downtime, which is required for upgrades or maintenance. Availability is measured in percent uptime divided by the total time. High availability is when this number is greater than 99.x, where the x represents between one and three positions of accuracy (for example, 99.999 percent availability is called “five nines availability”).
The terms availability and high availability can have very different meanings depending on the audience. These terms often describe a variety of business goals and technical requirements, from hardware-only availability targets to mission-critical targets.
An available system is one that comprises a set of system-wide, shared resources that cooperate to provide essential services. High availability systems combine software with industry-standard hardware to minimize downtime by quickly restoring services when a system, component, or application fails. While not instantaneous, services are restored quickly--often in less than a minute.
A high availability solution guarantees that a system will automatically recover in the event of a software or hardware failure. The goal of achieving high availability is to eliminate scheduled downtime and minimize unscheduled downtime. Often, organizations have inappropriate expectations regarding availability targets, and they might demand higher levels of availability than they are actually willing to pay for. Once an organization understands the total cost of implementation, many organizations revise their requirements. Implementing a high availability solution includes, but is not limited to, the following costs:
- Network infrastructure
As downtime approaches 0, high availability approaches 100%. Downtime includes both planned and unplanned downtime.
Table 1. Downtime averages
|Percentage||Weekly downtime Hours||Weekly downtime Minutes||Yearly average downtime|
According to Gartner Research Note, ID Number: AV-13-9472, the causes of downtime and their associated probabilities are as follows:
Table 2. Downtime causes and probability
|Causes of downtime||Probability|
|Software failures (unplanned)||40%|
|Hardware failures (unplanned)||10%|
|Human errors (unplanned)||15%|
|Environmental problems (unplanned)||5%|
As seen in Table 2, there are two categories of downtime: unplanned and planned. The environmentally-related and hardware failures are the least likely to happen, whereas software failures and planned downtime contribute up to 70% of system downtime.
Based on the above table and the desired high availability of 99.99%, the allowed downtime for the various causes are shown in Table 3:
Table 3. Allowable downtime for various causes
|Causes of Downtime||Downtime in a year|
|Software failures (unplanned)||21 minutes|
|Hardware failures (unplanned)||5 minutes|
|Human errors (unplanned)||8 minutes|
|Environmental problems (unplanned)||3 minutes|
|Planned downtime||16 minutes|
You can minimize hardware failure recovery times by doing the following:
- Addressing hardware redundancy: Hardware redundancy includes items such as redundant routers, servers, disks, and power supplies. Redundant hardware is not limited to only the physical machines that the architecture is running on, but also the associated infrastructure such as power and cooling facilities.
- Providing automated detection of failures: To provide true hardware recovery, you need some form of automated detection of failures. This can be provided through software such as High Availability Cluster Multi-Processing (HACMP), Application Server ND, and various other high availability solutions depending upon the platform being implemented and the solution's complexity.
- Using hot swap hardware facilities: Hardware recovery time can be minimized by using hot swap hardware facilities such as disks, network cards, and so on.
- Providing a stable and secure environment for the physical location of the system: Environmentally-related problems can be minimized using a stable and secure environment for the physical location of the system. For example, if the server is beneath someone's desk, the chance of failure is greater than if it's in a secure location.
You can minimize human errors by doing the following:
- Providing a clear system management interface: One of the biggest hindrances for high availability related to human errors is the lack of a clear system management interface. If your support personnel have to go to multiple consoles to manage a process or detect an error condition, the instances of human error increase.
- Providing a clear system operation procedure: The second issue which causes additional downtime in a failure condition is the lack of a clear system operation procedure. This can be as complex as a enterprise-level disaster plan or as simple as a defined process for a particular application. In either case, you need to store these plans in a convenient place for retrieval. These documents can also be referred to as an operational procedure document. You can minimize planned and unplanned downtime through the use of a well-defined and clear systems operation procedure.
Other ways to improve availability include:
- Addressing software redundancy
- Providing automatic detection of software failure
- Providing automatic recovery of software services
- Providing software-specific configuration
- Addressing specific application design
With Business Events V6.2 and V7.0, the ability to provide a high availability environment has dramatically improved from the initial product releases. With Business Events, IBM provides a completely scalable and available solution by exploiting native Application Server ND functionality. However, the technology connectors don't leverage this functionality, so they will need to rely on another method of providing high availability.
The Service Integration Bus (hereafter SiBus) is a logical entity that is
created and configured after installation of WebSphere Application Server
through either the administrative console, or through the
wsadmin scripting language. You can configure
the SiBus in two modes on an Application Server cluster, depending on the
requirements of the deployment. Because the SiBus is a logical entity
based on the physical implementation of a message-driven bean (MDB), there
is no inherent high availability or workload management functionality. You
need to independently configure the underlying physical implementation
prior to creation.
The rest of this article will show a highly available Business Events infrastructure using WebSphere MQ as the external messaging bus. Again, keep in mind that technology adapters are not, by their nature, built to be highly available. However, they can be run in an active passive mode to provide a higher level of availability for these feeder applications. Business Events, by their nature, are transitory items and are usually not valid after some period of time. Therefore, by default, all events are non-persistent in nature, and any event states that need to survive through a failure should be set to be persistent, and defined as durable events, in Business Events.
The high availability cluster configuration is the default configuration
created when a cluster of application servers in a cell is created. When
the SiBus is created, there is only one active messaging engine on one of
the cluster servers, and all service requests to cluster members are
routed through this single messaging engine. Therefore, for a cluster of
n servers, there will be one local message
put action for routing the service request on
the server with the active messaging engine, and (n-1) remote
put actions for each of the servers
with inactive messaging engines.
The advantage of this configuration is that when there is a messaging
engine failure, another server becomes active and all remote
put actions are routed through the newly active
messaging engine. Also, because there is only a single server routing
messages to the target service, any message sequencing is persisted
through the bus.
The disadvantage of this configuration relates to performance, since there
is effectively a messaging bottleneck within the cluster configuration,
and (n-1) of all service requests are remote
puts to the active messaging engine.
By using MQ along with a high availability service application, such as HACMP, Microsoft™ Cluster Service, or Veritas Cluster Server, you can further enhance the availability of MQ queue managers feeding the Business Events engines along with the technology connectors.
A service set must contain all of the processes and resources needed to deliver a highly available service, and, ideally, should contain only these processes and resources. It is recommended that you use the queue manager and the JMS adapter as the unit of failover for MQ, since Application Server ND will handle the application failover using Application Server high availability. Consequently, for optimal configuration, you should place each queue manager in a separate package, together with the resources that it depends on. The service set should therefore contain the shared disks used by a queue manager, which should be in a volume group reserved exclusively for the package, the IP address used to connect to the queue manager (the package IP address), and an application server, which represents the queue manager.
A queue manager that will be used in a high availability cluster needs to have its logs and data on shared disks, so that they can be accessed by a surviving node in the event of a node failure. A node running a queue manager must also maintain a number of files on internal disks. These files include files that relate to all queue managers on the node, such as /var/mqm/mqs.ini, and queue manager-specific files that are used to generate internal control information. Files related to a queue manager are therefore divided between internal and shared disks. Use a single shared disk for all of the recovery data (logs and data) related to a queue manager.
Figure 1 depicts a typical active/passive configuration. On each of the messaging servers, there is a set of application code that consists of the MQ server, the JMS adapter for Business Events, and the client connection package for Application Server. When a high availability event occurs, this resource group will fail over to the secondary machine and continue processing where it left off.
Figure 1. Typical active/passive configuration
Figure 2 depicts the high-level business events solution. It consists of an MQ cluster and a Business Events cluster. The Business Events cluster communicates with the MQ cluster through the use of JMS action and event points or the action queue and event queue. These queue types are each defined differently and each have limitations. Figure 2 shows the configuration that can be used for event streams.
Figure 2. Event stream configuration
The default configuration for a Business Events server is to use subscription points. This configuration has an inherent problem, as does any solution which uses subscription to a topic of duplicate events across each subscriber. You can address this problem with a single gateway event engine, but this leads to a single point of failure or an additional layer of complexity which might not be needed in all cases.
The second method of connectivity is to use technology connectors, which use parameters that are defined within the tools for the touchpoints. Through the use of point-to-point queues and having the event stream load balanced by MQ from the inbound side, you can ensure that there are no duplicate events being received by Business Events.
The third method is to configure Business Events to use a static queue for its events. The limitation of this method is that there is no segregation of events on the event server. All events come in on this single queue, are processed, and go to a single action queue. This configuration requires an external broker to route the action to its ultimate destination.
At this point, you should be aware that any in-flight non-persistent and non-durable events will be lost in the event of a failure. In the case of contextual data, this data should be persisted to the steps table and be available to the new Business Events node when it fails over. This configuration has one known limitation: in the case of the JMS technology connector, security must not be turned on. All of the other technology connectors cannot be made highly available at this time due to a limitation in the Technology Connector Framework. We'll address this issue later. If there is a case where you need high availability for the other technology connectors, it is recommended that you use some form of integration bus, such as WebSphere Message Broker or WebSphere Process Server, instead of the associated technology connector.
MQ clustering will provide load balancing for all of the inbound events, and the JMS adapter and queue manager will fail over as a pair to the passive node in the availability cluster. However, using MQ clustering will cause multiple hops within the Business Events server cluster for contextually based events, such as a context object or a contextual summary intermediate object, since these events are associated with a single Business Events node. Consequently, you should expect some additional latency with these type objects.
When you require a more robust high availability environment, you should
configure a secondary idle container on each Application Server instance
for the rollover to fail to. When the event is received into the Business
Events engine, additional load balancing will occur since this is part of
the base functionality of Business Events clustering. Set up the steps
table to use the
ObjectGrid for replication;
the table should be backed up by a datastore such as DB2® or
Oracle®. With this configuration, the data will be replicated across
all nodes in the cluster. This context data will then be used to recover
from the point of failure when durable events are used in the case of a
failure event. These replications are not active in the cluster until the
contextual node for an event is no longer available. At this point,
ObjectGrid assigns a new context handler for
the event, and processing continues. Only one context handler is available
at any one time in the cluster for a particular event. This is what causes
the multiple hops mentioned above.
Figure 3. Robust high availability architecture
When you're planning your high availability event infrastructure, keep in mind that only durable events will be preserved across a failover event. Once an event has been consumed off the queue, these events are not rolled back.
The following configuration sets up an instance for each node in the Business Event nodes running under an Application Server ND cluster, along with a paired passive machine for each node of the cluster. This separation allows one node to undergo standard maintenance while the other node continues processing. Additionally, the pairing of the failover node allows for the node and all of its resources, including MQ, to fail over as a single unit to the passive machine. The MQ queues, which are used for inbound and outgoing messages, will be stored on a common storage array network and standard high available procedures can be used for failover. Figure 4 shows a common platform layout for a production region.
Note: Because the event connectors are not used in this configuration, the inbound JMS traffic is limited to a single queue for either durable or non-durable events.
Figure 4. Architecture using no Tech Connectors
Note: MQ security cannot be enabled since the technology connectors do not currently support this enablement in a way that provides high availability. Solution messages could potentially become stranded if MQ and the technology connector are not failed over as a pair with the logs and queues stored on a storage array network drive for failover. The Diagram below shows how this architecture would work.
The following configuration sets up a machine for each node in the Business Event nodes running under an Application Server ND cluster, along with a single passive machine. This separation allows one node to undergo standard maintenance while the other node continues processing. When a event occurs, the passive machine will be started and the processing node joins the MQ cluster and begins to receive events. This enables additional processing to be performed on an as-needed basis by starting and stopping the instance, and allowing it to join the cluster.
Another variation on the above design is to place MQ and the technology connectors in a high availability server group, and have them fail over as a pair to the passive box. This requires configuration of the technology connector to publish to a Business Events topic not necessarily on the same machine. It also requires that MQ place its queues and logs onto a storage array network much like the primary solution shown for production.
Note: MQ security cannot be enabled since the technology connectors do not currently support this enablement in a way that provides high availability. Solution messages could potentially become stranded if MQ and the technology connector are not failed over as a pair with the logs and queues stored on a storage array network drive for failover.
Figure 5. Technology connector high availability
Table 4. WebSphere ND JMS settings for Business Events.
You can find the following queue settings in the JMS configuration dialog in the Business Events administration console. As shown in Table 4, all of these JMS entries can be either pub-sub or static queue. By default, they are pub-sub. If technology connectors are not used, you'll need to manually configure these JMS entries to use static queues instead . Also, note that all messages need to be in the Business Events V6.2 format, as described in the WebSphere Business Events V6.2 Information Center. You are also limited to two incoming and two outgoing queues, one durable and one non-durable. If you wish to use these queues directly, it is recommended that you place either WebSphere Message Broker or WebSphere Process Server in your architecture to handle the transformation and routing of events into and out of the Business Events server.
This article demonstrated how you can leverage Business Events in a high availability environment by using the built-in functionality of Application Server, along with additional settings and configuration options, to enable a highly available enterprise event infrastructure.
- WebSphere Business Events V7 Information Center: Get complete
developerWorks BPM zone: Get the latest technical resources on
IBM BPM solutions, including downloads, demos, articles, tutorials,
events, webcasts, and more.