Mission:Messaging: Migration, failover, and scaling in a WebSphere MQ cluster

Certain aspects of service orientation are best served using an IBM® WebSphere® MQ cluster. The cluster provides the location independence, run time resolution of names, and concurrency required by SOA applications. For these reasons, adoption of SOA is driving migrations from point-to-point messaging networks to clustered environments. This article looks at how migration, failover, and the scaling of queue managers are affected in an SOA context. This content is part of the IBM WebSphere Developer Technical Journal.


T.Rob Wyatt, Senior Managing Consultant, EMC

T.Rob Wyatt is a Senior Managing Consultant with IBM Software Services for WebSphere who assists customers with the administration, architecture, and security of WebSphere MQ. Recently he has focused on WebSphere MQ security, publishing in the IBM WebSphere Developer Technical Journal and presenting at the IMPACT and European Transaction and Messaging conferences. T.Rob also hosts The Deep Queue, a monthly WebSphere MQ security podcast.

developerWorks Professional author

12 November 2008

Also available in Chinese

In each column, Mission: Messaging discusses topics designed to encourage you to re-examine your thinking about IBM® WebSphere® MQ, its role in your environment, and why you should pay attention to it on a regular basis.

Messaging impact on SOA

In the previous installment of Mission:Messaging, I wrote that evolving from point-to-point messaging architectures toward service orientation calls for updates to many of the long standing best practices in the messaging world. Here, we will look at a case study to examine migration, failover, and scaling of queue managers, and the impact to naming conventions, tooling, administrative processes, and operations when these activities are considered in an SOA context.

First, a few terms:

  • Migration in this discussion includes any case of rehosting a queue manager, perhaps to refresh the underlying hardware or to move to a different platform. Migration will always involve building a new queue manager, a logical move of the application and queues to the new queue manager, and the eventual decommission of the old queue manager.

  • Failover is the planned or unplanned shutdown of the primary system and includes the accompanying task of bringing a standby node online to take over the processing load. The complementary action is to failback when the primary node is recovered. The classic example is a disaster recovery test, which involves failing over to a warm standby system, testing the applications, and then failing back to the primary systems.

  • Horizontal scaling is defined as changing the number of concurrent instances of an input queue in the cluster in order to increase or reduce processing capacity. Horizontal scaling to accommodate growth is usually permanent. Scaling to accommodate peak processing seasons is a cyclical process of first increasing, then reducing, capacity. The process of scaling up may or may not involve the build of a new queue manager, and usually does not involve any changes to the existing instances.

Certain aspects of service orientation are best served using an IBM WebSphere MQ cluster. The cluster provides the location independence, run time resolution of names, and concurrency required by SOA applications. For these reasons, adoption of SOA is driving migrations from point-to-point messaging networks to clustered environments.

The point-to-point paradigm

Most well-established WebSphere MQ shops will have suffered through at least one hardware migration and experienced enough growth to have required scaling up at some point, so it is likely that there are established procedures for these activities. Hopefully, all shops have a disaster recovery failover plan, even those new to WebSphere MQ. What do these procedures look like? In the point-to-point world, migration, failover, and scaling are usually distinct and very different from one another:

  • Migration: It is common for migrations to be scheduled and worked as a single event where all the tasks occur in an uninterrupted sequence, including the bulk of the building and configuring activities on the new target queue manager. In this way, the current state of the retiring queue manager is captured at cutover time and moved intact to the new host. Because it is a one-time event, there are few, if any, accommodations for fall back.

  • Failover: The objective here is to switch between two functionally equivalent queue managers. Although they are (or at least should be) different queue managers with distinct names, to the rest of the network the only apparent difference between the two is the CONNAME. Because failover always anticipates failback, it is usually worth the time to create automation or processes that promote consistent, reliable, and repeatable execution of the activities. Typically failover involves a buildout phase independent of the actual cutover execution.

  • Scaling: Upgrading capacity is a more common case of scaling than is accommodation of cyclical loads. As a result, most instances of scaling are planned and executed as a one-time event, similar to a migration. The main difference is that after the event, procedures are created to ensure that all instances of the queue manager remain in sync, and changes applied to one are applied to all.

Point-to-point implementation

In this architecture, the queue manager is the root context for object names, and the procedures for administration and operation reflect this orientation. Because changing queue manager names in a point-to-point network is disruptive, it is tempting to reuse the same names for queue managers and channels during failover and migration. This is actually an anti-pattern; one of those things that initially seems like the right idea but often later turns out to be a nightmare. Despite the problems with queue manager name reuse, most migration plans that I've seen depend on it. As a result, it is difficult or impossible to have both queue managers online at the same time, which affects scheduling and execution of the migration tasks.

In the case of scalability, where the primary goal is concurrency, reusing a queue manager name causes more problems than it solves, so there is usually no temptation to create duplicate named instances. In this case, a queue manager alias is typically used to achieve node equivalence, but the overall effect is still that the queue is resolved in the context of its queue manager.

Another aspect of the point-to-point architecture is that the run time configuration tends to remain fairly static. In fact, many of the processes and procedures assume stability in the configurations. For example, take the case of object definitions. In many shops, these are stored in mqsc scripts. The initial version of the script contains the local baseline for a queue manager, such as setting the dead queue, locking down remote administrative access, and tuning channels. Next, application specific objects are added either in their own scripts or to the master script. As new queues, topics, and other objects are added, the scripts are rerun. This action redefines the existing objects in place and creates any new objects. A typical example looks like this:

Listing 1
       DESCR('APP service queue for QA') +
       BOTHRESH(5) +
       CLUSTER('DIV_QA') +
       CLUSNL(' ') +

The first time this definition is run, the local queue APP.FUNCTION.SUBFUNCTION.QA is created. The REPLACE option ensures that the definition will not generate an error on subsequent runs. The key here is that all network maintenance activities are performed at build time. The exception to this is failover. Because failover is designed into the system, the scripts to execute it and the complementary failback scripts are usually created ahead of time. Typically, these are also mqsc scripts, but instead of DEFINE statements, they consist of ALTER statements such as this:

Listing 2

One set of scripts would be created to failover and another that would failback. Although these scripts execute at run time it is worth noting that they use the same tooling as the build time activities. Note also that the routing of messages is accomplished by reconnecting the physical network. We'll come back to this later.

At least two key assumptions that guided the evolution of WebSphere MQ best practices in the point-to-point era are no longer true in service-oriented architecture. There are many more, but two are relevant to the illustration here. These are:

  • the queue manager is the root for name resolution.
  • the object definitions are relatively static.

The result, as you saw above, was that nearly all operations were build time operations, changing queue manager names disrupted object name resolution, and making changes to message routing required reconfiguration of the physical network.

The SOA paradigm

Service orientation changes all of this in ways that make the messaging network more resilient, more transparent, and easier to administer. Or, at least, it can potentially change these aspects of the messaging network. To reap the benefits, it is necessary to update some of our best practices. Certainly, it is possible to design the applications for service orientation but deploy them onto a traditional messaging network, and this is, in fact, what happens much of the time.

Some of service orientation features that are relevant to this discussion are:

  • Location independence: This means that a destination should be available from anywhere in the network. Or, to put it in the context of the earlier discussion, the cluster is the root container for the resolution of names rather than the queue manager. The role of the queue manager recedes to that of an anonymous container of queues and topics. This is true even when we consider reply-to queues. The queue manager name is required to fully qualify the address of a reply message, but it is determined at run time and is only valid for the life of the reply message. Which brings us to:

  • Run time name resolution: Rather than physically reconfigure the network to change routing, you can use the WebSphere MQ cluster functionality to achieve the same result. The basic topology is built from advertising various objects, such as queues and aliases, to the cluster. Then, day-to-day operational changes are performed by enabling or disabling those objects in the cluster, or by suspending and resuming queue managers.

The key here is that moving name and route resolution from the physical network to the WebSphere MQ cluster requires a separation of tasks and attributes that are applicable to build time from those that are applicable to run time. It requires our tools and processes to recognize that the network now has state and that the state must be accounted for when making changes. Retooling sounds like a lot of work. It's not really but before we look at that, let's see what it buys us.

Moving the name resolution up into the cluster means routing is no longer tied to the physical network connectivity. The cluster provides a virtual mesh network in which every node is connected to every other, and in which routing can be controlled through manipulation of objects in the cluster namespace. There is, of course, the benefit of not having to administer channels, but this is so much more than that. Controlling routing at the physical network layer embeds policy into topology. It is very inflexible, very limiting, and creates dependencies between otherwise unrelated applications. Controlling routing in the cluster enables any number of topologies to co-exist independently on the same physical network, and also enables us to align the cluster namespace with the service registry namespace. The applications providing and consuming services will resolve names in the service registry which, in turn, are resolved in the WebSphere MQ network. The more closely these namespaces are aligned, the less friction is encountered when name resolution moves vertically between them.

Going back to our use case, migration failover and scaling in an SOA environment are no longer very different processes, but are now minor variations of the same process. Failover becomes the template for all three processes. There is a build time task to create the standby environment and run time tasks to toggle between the primary and secondary nodes. Migration is simply a failover that never fails back, and scaling up is a failover in which the primary node never goes offline. Anyone running these processes now needs to learn only one basic task list with a few minor variations, which leads to more consistent results and fewer defects. In addition, it is now possible to failover individual applications or even single resources, whereas previously the unit of failover was an entire queue manager. Given the trend toward consolidation, this will enable many more applications to share a queue manager.

SOA implementation

I said earlier that the retooling need not be extensive. The main requirement is that your build time tools take into account the run time state of the system and don't arbitrarily reset it. You still want to keep the object definition scripts and you would still like to be able to run them at anytime -- in their entirety -- to add new objects. A small change to the object definitions will accomplish these goals. All that is required is to separate out the build time object attributes from the run time attributes. For example, the run time attributes of a queue include whether triggering is turned on, and whether PUT and GET are currently enabled or disabled. Using the same queue as in the previous example, the object definition script now looks like this:

Listing 3
       GET(DISABLED) +
       PUT(DISABLED) +
       NOTRIGGER +
       DESCR('APP service queue for QA') +
       DEFPSIST(NO) +
       BOTHRESH(5) +
       CLUSTER('DIV_QA') +
       CLUSNL(' ') +

All of the attributes that are considered run time are placed in a DEFINE statement with the NOREPLACE option. The first execution of the script creates the queue in the initial run time state. The queue (as defined in the example) is effectively hidden from the cluster because GET and PUT have been disabled. You can now use the administration tool of your choice to update the GET and PUT status of the queue without any danger that the script will reset them back to their initial state. Additional executions of the script simply skip over the run time attributes because of the NOREPLACE option. The next statement alters the queue with what are considered build time attributes. This statement is executed on every run of the script and if you need to change something permanently, that happens here.

You saw earlier that moving name resolution up to the cluster level diminished the role of the queue manager. In the new paradigm, queue managers are merely life support for queues and topics. The actual queue manager name becomes much less important to the application or the service registry. In fact, if you continue to give the queue manager names that are meaningful to the application or to the service registry, they eventually become constraints, locking you into a particular topology or deployment pattern and robbing you of the flexibility you hoped to gain by adopting SOA. In the new model, names meaningful to applications or the registry (queue manager alias, queues, topics) move up into the logical layer and the remainder (queue managers, channels, clusters) are pushed down to the physical layer. If you’ve been around MQ long enough, you probably remember when the best practice was to name the queue managers after the host. We moved away from that over the years, but the implementation I am proposing brings us back full circle. In this model, name the queue manager for the host it resides on and use queue manager aliases to implement routes and destinations that are meaningful to the applications.

With name resolution delegated to the cluster and scripts enhanced, let's see what the use cases look like now:

  • Migration: This process now looks like a failover but without any expectation of failing back. There are now distinct build time and run time tasks. The new queue manager is built, brought online, and suspended from the cluster. The queues are defined and disabled in the cluster, either by disabling PUT and GET or by not setting the CLUSTER or CLUSNL attributes. Because the new queue manager can peacefully coexist with the existing one, it is possible to test much of the deployment well ahead of cutover. The application can be installed and the authorizations tested, for example. Cutover is executed just like the first part of a failover test. If the cutover is successful, the old node is decommissioned. If not, migration looks exactly like a failover test because the failback technique replaces the old style backout plan.

  • Failover: The basic tasks of failover remain the same: switch processing to the secondary node and, later, switch back. The difference is that instead of executing this by altering CONNAME attributes to reroute the physical network connections, the routing is accomplished using the facilities of the WebSphere MQ cluster. In most cases, this means enabling queues at the secondary node and disabling them on the primary. Alternatively, queue manager aliasing could be used to establish one or more queue managers as destinations. In this case, the failover would involve enabling the alias on the secondary nodes and disabling it on the primary. Failback in either case involves reversing the actions. Regardless of the model used, the operation could be implemented as scripts or through administrative tooling.

  • Scaling: Scaling up is a failover where the primary node remains online. Scaling down is like a failback operation without the task of bringing the primary node up.


Deconstructing our use cases, you can see that they are now all composed of a few common sub-tasks:

  • Build a queue manager and enroll it in the cluster.
  • Create routing entries at the logical layer.
  • Build queues and topics in their initial state.
  • Enable a node.
  • Disable a node.
  • Decommission a queue manager.

Migration runs through all of these tasks more or less in order. Failover uses the enable/disable tasks twice in succession. Scaling up uses enable tasks, scaling down uses disable tasks. If you wrote out each of these sub-tasks separately, creating a procedure manual for each of your use cases would simply be a matter of assembling the sub-tasks in the right order. Starting to sound familiar? Reuse, encapsulation, process assembly: by delegating name resolution to the cluster and enhancing your tooling so as not to disrupt the state of the cluster, you have SOA-enabled the messaging network.


Most WebSphere MQ best practice documents you find today will not advise you to tie the queue manager name to the physical network. If you do find such a document, chances are it will be ten or more years old. Most of the same documents will contain sample object definition scripts, but none of them will illustrate the DEFINE/ALTER technique I propose here. Does that make these documents wrong? No, they are perfectly valid in the context of the point-to-point messaging model that was prevalent when they were created. But if you attempt to implement a service oriented architecture onto a messaging network that is point-to-point at its core, the result will be, at best, something that is inflexible and difficult to administer and, at worst, fragile and unreliable.

To get the most out of service orientation, it needs to be implemented in the messaging layer and not just in the applications above. To do that requires reevaluating your established best practices to make sure that the underlying assumptions they are built on still hold true. In this example, you changed the assumption and practice of implementing message routing at the physical layer and moved it up into a logical layer where it is managed by the cluster. This meant that the cluster now has run time state, so you enhanced your script tooling to take state into account. These small changes enabled the final phase of the transformation, which was to decompose the processes of migration, failover, and scaling into atomic, reusable tasks that you then assembled into composite workflows.

The obvious result was that the three use cases came to resemble one another closely, and from that you can expect reduced training requirements and fewer defects due to human error. What is less obvious, however, is how these changes make the network so much more capable. With routing defined in a logical layer above the physical network, it is possible to have many overlapping topologies implemented on the same network fabric. This alone is worth the cost of retooling. Redistributing workload by moving one or more instances of an application becomes trivial. All of the queue managers -- primary, secondary, disaster recovery nodes -- can be online simultaneously. This separates the build and deploy tasks for new queue managers and enables early testing and verification of nodes. Because these techniques work equally well in the point-to-point model, they can be implemented on either type of network.

In the end, though, this is just one example of how new requirements and changing environments break existing processes and best practices. Keeping up to date will require an occasional re-examination of our MQ tooling and methods, and willingness to embrace change.





developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into WebSphere on developerWorks

Zone=WebSphere, SOA and web services
ArticleTitle=Mission:Messaging: Migration, failover, and scaling in a WebSphere MQ cluster