Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

WebSphere Business Integration Message Broker and high availability environments

Andrew Humphreys (andrew_humphreys@uk.ibm.com ), Knowledge Transfer Leader, Software Services for Websphere, IBM Hursley United Kingdom
Author photo
Andrew Humphreys is a Business Integration specialist working as a consultant for IBM Software Services for WebSphere. He has extensive experience in architecting WebSphere Business Integration Message Broker solutions and has used these skills at customer sites around the world. He is recognized by IBM customers, the IBM services community and the Hursley development laboratory as a worldwide expert on WebSphere Integration Message Broker and WebSphere MQ solutions.

Summary:  Project architects need to consider the importance of how to include WebSphere® Business Integration Message Broker in a highly available configuration. This article describes how to achieve high availability through a combination of application design and use of software-based clustering of queue managers using MQ clustering and machine clustering provided at the hardware level.

Date:  10 Mar 2004
Level:  Intermediate

Activity:  10461 views
Comments:  

Introduction

The use of any software product in a business-critical or mission-critical environment requires consideration of availability, which is a measure of the ability of a system to actually do what it is supposed to do, even in the presence of crashes, equipment failures and environmental mishaps.

So it is important to consider how to ensure the performance and availability targets for a WebSphere Business Integration Message Broker V5 implementation are met. As the most common implementation of Message Broker is as a central hub that all messages in a messaging architecture are processed though, it is a potential bottleneck and single point of failure for the whole WebSphere MQ Queue Manager network.

This article outlines some of the options for implementing WebSphere Business Integration Message Broker V5 in a highly available environment.


Availability considerations

Many issues contribute to high availability of a Message Broker environment. For example, consider the following list:

  • Reliable hardware
  • Shared queues
  • Heath monitoring
  • Failover clustering
  • Online backup
  • Dual networks
  • Reliable operating system
  • Online reconfiguration
  • WebSphere MQ clustering
  • Fast reboot
  • RAID disks
  • Crash recovery
  • Fast start up
  • IP takeover
  • Documented procedures
  • Practicing procedures

All of these factors are important when considering availability.

A common target for availability is 99.999 percent yearly availability. Consider the formal definition of availability to be:

Percent_availability = Up_Time / (Up_Time + Down_Time) * 100

On this basis, to achieve 99.999 percent yearly availability on any given system, that system can have at most 5.26 minutes of downtime per year.

Down_Time can be split into:

Down_Time = Scheduled_Down_Time + Unscheduled_Down_Time

It follows that there is a limit to the levels of availability that can be achieved by avoiding unscheduled outages. A system that never crashes and is taken down for scheduled maintenance for 30 minutes each quarter (2 hours down time over a total of 8,766 hours per year) will have a yearly availability of:

8,764 hours / 8,766 hours * 100 = 99.977%

In other words, 99.999% availability is a very aggressive target, and generally requires the setting up of a separate project to analyze:

  • Hardware and software configurations. Consider disk mirroring, server redundancy, and uninterruptible power supply.
  • Application design. Developers must design applications that support non-disruptive release upgrades.
  • Data center organization. Consider strict change control, comprehensive testing, and operations support.

Consider some of the server-related failures that might occur. They include disk failures, processor/memory, and power supply or power itself, network, and so forth. There are many other possibilities.

One approach to mitigating against server-related failures is to install multiple instances of the components that might fail. For example, disks can be mirrored across different disk controllers. Multiple CPUs can be installed (although in an SMP machine there is still only one system memory), multiple or backup power supplies and UPSs can be installed and multiple network cards and network connections can connect through different routers.

In addition to the previous hazards, there are other reasons why a server might fail -- including, for example, errors in the software it runs or operational errors introduced by inappropriate use. In safety-critical applications it is common to run multiple different instances of software, built using different compilers, on different processor architectures and even using different algorithms. This is not generally feasible for commercial uses.

The best advice for commercial users is to restrict use of production servers to thoroughly tried, tested, and minimalist configurations of software and operational procedures. Even where built-in redundancy has been used, it cannot protect a server against environmental problems, which might affect a whole machine, room, or building.

Larger-scale protection/backup is needed -- this is the area of Disaster Recovery in which a remote site is run in parallel with the main production site. Operation is switched to the remote site if an environmental problem hits the production site.

A common issue of all the preceding failures (even the environmental ones) is that they pose a threat because the "system" contains single points of failure. Successful design reduces the number of single points of failure. Disaster Recovery provision can prevent one site from being a single point of failure.

Within a site, built-in hardware redundancy can eliminate some types of single point of failure within a server, but not all of them.

There are several approaches to maximizing availability, one of which is to build clusters of commodity computers that cooperate to run critical services. These clusters are generally referred to as High Availability (HA) Clusters. The use of clustered servers is similar to built-in redundancy, but deals with whole servers rather than internal components. Clustering servers provides a cost effective means of avoiding all hardware single points of failure at a site.


Messaging and availability

Availability in messaging software needs to address two types of messages, those that have already been queued ("existing messages") and those that have yet to be queued ("new messages").


Figure 1. Clustering options positioning
Screenshot of clustering options positioning

Figure 1 shows the degrees of availability achievable:

  • "Continuous" means uninterrupted access to messages.
  • "Automatic" means that availability can be restored without manual monitoring or intervention. This helps to minimize downtime.

The techniques are:

  • At the bottom are systems with no special consideration to high availability.
  • Next are failover clusters, commonly referred to as shared disk clusters (examples include HACMP or MSCS).
  • Overlapping with failover clusters are WebSphere MQ Clusters.
  • At the top is shared queue support and fault tolerant hardware.

The point to note from the diagram is that on the platforms that Message Broker is available on, it is only a combination of distributed (that is, MQ Software clustering) and restart/failover clustering (that is, hardware clustering) that provides HA support for all messages.


Message Broker highly available configurations

High availability and throughput of Message Broker hubs can be achieved by a combination of two distinct technologies:

  • Software clustering to provide load balancing and high-service availability across the whole hub (by allowing individual servers within a hub to become unavailable while the other servers continue to operate and service requests to the hub).
  • Hardware clustering (for example, HACMP) to provide high availability of a single server within a Message Broker hub.

It is recommended that Bank One consider using both of these technologies in their messaging hubs to achieve high throughput and availability. By combining WebSphere MQ Clusters with HA (shared disk) Server Clusters, the queue manager(s) that the failed node was running can be failed over to a healthy node and restarted, making the messages available again much sooner than would otherwise be the case and horizontal scaling can take place to maximize hub throughput.

Software clustering

Consideration should be given to having a network topology consisting of a "logical hub" consisting of multiple physical WebSphere MQ Queue Managers, serving multiple brokers in Message Broker, to spread the workload and so improve performance and availability.

The logical hub should consist of 1-n servers running Message Broker Brokers. Each server in the logical hub has 1-n WebSphere MQ Queue Managers with a 1-1 mapping to Brokers on that server. It is possible either to have each broker in the logical hub with identical execution groups and message flows deployed so that each broker can process any type of message, or to have each broker deployed with different message flows defined so work is spread throughout the logical hub, that is, some messages that are required to be processed quickly go to a broker running on a fast server while messages that require a slower response go to a slower server 1.

The Queue Managers within the logical hub should be clustered (using WebSphere MQ clustering) to provide high availability and load balancing across each server in the hub. Applications outside of the cluster will put messages only onto logical WebSphere MQ Integrator input queues. WebSphere MQ clustering will propagate messages to each WebSphere MQ Integrator Broker's instance of the input queue using the WebSphere MQ clustering load balancing and availability algorithm.

WebSphere MQ clustering

WebSphere MQ Clusters simplify admin and permit workload balancing (of message traffic). They also provide higher availability for new messages, since they can be put onto other instances of a cluster queue.

WebSphere MQ Clustering should be used to implement horizontal scalability and workload distribution. Figure 2 shows how a WebSphere MQ Queue Manager Cluster can be organized to provide workload distribution.

In figure 2, APP1 and APP2 are two concurrently running instances of the same application. B1 and B2 are two concurrently running Brokers, configured identically. Messages from application instance APP1 are sent to hub, addressed to the queue "BROKER.REQUESTQ". This Queue has been defined both on Queue Manager QMB1 and on Queue Manager QMB2. Both of these queue managers are members of the same MQ Cluster. Note that it is the Queue Managers (QMB1 and QMB2) that are in the cluster, not the brokers themselves. Brokers cannot be added directly to MQ Clusters so you must add the Queue Managers that the brokers use.

In sending its request message, APP1 is unaware of this, having put the message to its local Queue Manager using only the Queue name as the destination. As this queue is not defined locally on QMA1, the Queue Manager checks its cluster definition to see if the queue is defined as a cluster queue. It has two definitions of the queue, one at QMB1 and one at QMB2. WebSphere MQ determines at execution time which of the two instances of BROKER.REQUESTQ will receive this message by round-robbing messages to each queue.

In the diagram, two request messages are sent and WebSphere MQ clustering sends one to QMB1 and one to QMB2. APP1 specifies MY.REPLYQ as the Queue on which it expects the reply message. This can, but does not need to be a cluster Queue -- but note in this instance, it is required that the reply message is sent to MY.REPLYQ on Queue Manager QMA1. This should be managed in the Broker through retention and use of the ReplyToQ and ReplyToQMgr fields placed in the MQMD by APP1.

Having multiple instances of a reply queue in a Cluster could lead to problems if the ReplyToQMgr is not correctly specified in the reply message. In this case, the reply message expected by APP1 might find itself on the instance of MY.REPLYQ on Queue Manager QMA2. The above is of course true if APP1 and APP2 are not designed in such a way that they can handle reply messages that originated from another instance of the same application or adapter.

WebSphere MQ Clustering presents a number of potential problems for the MQ system administrator. For example, if a Queue Manager fails, then although any subsequent messages sent to the cluster will not be routed to the failed Queue Manager, any messages already sent to the Queue Manager, but not yet processed by the broker, will be marooned on the failed Queue Manager.

For WebSphere MQ clustering to load balance between queues on different Queue Managers messages sent from outside the cluster need to be sent to a "gateway" Queue Manager. This creates a single point of failure for the whole logical hub. Therefore this gateway Queue Manager would need to have very high availability. To overcome this, the Queue Managers that the messages originate on could be included in the cluster. Provided that they do not have queues defined with the same names as the Message Broker input queues defined on multiple Queue Managers in the cluster load, balancing will still be achieved.


Figure 2. Load balancing MQ cluster
Screenshot of load balancing MQ cluster

In a correctly configured WebSphere MQ cluster, all queue managers that host a full repository for the cluster must be fully interconnected using locally defined cluster sender channels. When a WebSphere MQ queue manager that hosts a full repository for a cluster receives an update, it is propagated to the other full repositories by sending it along all the locally defined cluster sender channels. It is NOT propagated along the auto-defined cluster sender channels.

Unless a cluster is very large there is no good reason to have more than two full repositories in a cluster. It complicates setup and has no beneficial effect.

MQ Clustering and WebSphere MQ application design

When designing WebSphere MQ applications, it is important to avoid message affinities as much as possible. Message affinities arise when multiple messages are required to make up a single business transaction or where messages must be processed in a specific order.

They should be avoided as they prevent scaling though parallelism. MQ and WMQI scale well horizontally through the use of clustering. However, message affinities prevent successful scaling as they require messages to be processed through the same thread, broker, and so on; this creates a bottleneck. Although it is possible to tune performance on a single instance of a flow or a broker, there will always be physical limits to the throughput that can be achieved. Avoiding affinities lets applications run across multiple nodes and so scale better.

For example, consider a stock management system that issues several records that make up a single order; a header record, several individual line item records and a trailer record. It may seem the simplest approach is to send each record as a single message; however, this would be a poor design as it creates affinities between the messages, that is, each message requires the others to make up a single business transaction. This could have implications to WMQI if, for example, the routing information that determines where WMQI will send the order is contained in just the header. This would require WMQI to contain logic to combine all the messages to ensure they were all routed correctly. A better design would be to send all the order records as a single message.

Another example of message affinities that could affect WMQI is where there is a dependency on the order messages are processed by the system. For example consider a system that processes new customers and adds them to a customer database. The design may require that all new customers are processed single threaded and to ensure that new customer numbers are allocated correctly and to avoid creating duplicate customer records in the database.

Hardware clustering

At its most basic level, hardware clustering will automatically switch over from a failing server to another server, thus minimizing unscheduled down time. The following diagram depicts a typical hardware clustering configuration.

The figure shows a pair of clustered servers that both have access to a shared disk enclosure, containing multiple physical disks.

Such a configuration is frequently referred to as a "shared disk cluster", but it is actually a shared-nothing architecture as no disk is accessed by more than one node at a time. In addition, it is usual to have multiple power supplies and networks.


Figure 3. Clustered servers
Screenshot of clustered servers

The physical disks are generally organized into RAID-1 or RAID-5 logical volumes and each server will have multiple disk controllers, which connect to different portions (e.g. sub-mirrors) of the RAID array.

In a server cluster, a failure that affects one node can cause the "failover" of work that the node was running, onto another (healthy) server. The figure shows an application instance (in this case a queue manager called QM1) being failed over from the left node to the right node.

The cluster software detects that there is a problem with the left node -- this could be a hardware or non-hardware problem. The standby machine will:

  • Take over the IP address.
  • Take over the shared disks.
  • Start the queue manager and associated processes (channels, listener, trigger monitors).

This is commonly known as a cold-standby configuration; only one node is actively running workload at a time. This is a simple but relatively expensive configuration. Another option is to deploy an active/active configuration, also known as mutual takeover configuration, in which multiple nodes can simultaneously be running.

In the context of WebSphere MQ, the queue manager specific directories under /var/mqm are put on shared disks, thus allowing different nodes to access these sub-directories simultaneously. This makes it possible for multiple nodes to be performing WebSphere MQ work at the same time. An active/active configuration has higher server-utilization than a cold-standby configuration. On some platforms, this would be a slightly more complex configuration to set up, but this is automated by scripts in WebSphere MQ HA Support Packs such as MC63 and IC61.

To minimize the time of an outage, the operators must be trained to perform the switch over reliably and efficiently. It is strongly recommend that you run switch over rehearsals at regular (monthly or bi-monthly) intervals. This ensures that operators know the procedures. It also allows early detection of configuration changes between production and system test.


Conclusion

This article discussed the importance of considering how to include WebSphere Business Integration Message Broker as a highly available configuration.

It also looked at how high availability can be achieved through a combination of application design and use of software based clustering of queue managers using MQ clustering, and machine clustering provided at the hardware level.

High availability is not something that can just be switched on and off. It is important to understand that if very high levels of availability are required from the Message Broker, then the project architect must consider the availability and throughput requirements of the application, and plan and test extensively to ensure this is achievable.


About the author

Author photo

Andrew Humphreys is a Business Integration specialist working as a consultant for IBM Software Services for WebSphere. He has extensive experience in architecting WebSphere Business Integration Message Broker solutions and has used these skills at customer sites around the world. He is recognized by IBM customers, the IBM services community and the Hursley development laboratory as a worldwide expert on WebSphere Integration Message Broker and WebSphere MQ solutions.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=13646
ArticleTitle=WebSphere Business Integration Message Broker and high availability environments
publish-date=03102004
author1-email=andrew_humphreys@uk.ibm.com
author1-email-cc=andrew_humphreys@uk.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers