Mission:Messaging: Embracing cultural change in the WebSphere MQ community

The WebSphere® MQ community has been using the same best practices for a long time, and with all of the change that has occurred in the industry, today's underlying processing model has drifted far from the model that existed when these best practices were developed. This difference represents significant potential risk, but we have an opportunity now, as a community, to close this gap and bring our best practices in line with the use cases currently employed. This content is part of the IBM WebSphere Developer Technical Journal.

Share:

T.Rob Wyatt, Senior Managing Consultant, IBM

T.Rob WyattT.Rob Wyatt is a Senior Managing Consultant with IBM Software Services for WebSphere who assists customers with the administration, architecture, and security of WebSphere MQ. Recently he has focused on WebSphere MQ security, publishing in the IBM WebSphere Developer Technical Journal and presenting at the IMPACT and European Transaction and Messaging conferences. T.Rob also hosts The Deep Queue, a monthly WebSphere MQ security podcast.



03 September 2008

In each column, Mission: Messaging discusses topics designed to encourage you to re-examine your thinking about IBM® WebSphere MQ, its role in your environment, and why you should pay attention to it on a regular basis.

The cultural heritage of WebSphere MQ

In the fifteen years that WebSphere MQ has been available, an ongoing dialog amongst the user community has produced a cultural heritage of common knowledge and best practices. This collective wisdom has accumulated in online forums, conference proceedings, technical journals, and the private document repositories of thousands of IT shops worldwide. It has been refined, polished, tweaked, and tuned over the years to the point that the body of knowledge is remarkably consistent and persistent.

This is a mixed blessing.

In the first Mission: Messaging column, I wrote that the accessibility of WebSphere MQ has, in many shops, led to less emphasis on formal training and that the resulting skill gap often resulted in outages. The flip side of this, however, is that the same accessibility -- our cultural heritage of best practices and common wisdom -- lowers the barriers of entry into WebSphere MQ for new users and raises the overall quality of implementations. These benefits are a natural and powerful incentive for the WebSphere MQ community to cultivate this system of collective knowledge. The incentive is so strong, in fact, that the community will sometimes perpetuate practices that it no longer understands, that have little value or, in some cases, that are actually destructive anti-patterns. This illustrates the principle of cultural inertia -- the tendency of a meme at rest to remain at rest.


Difference in degree vs. difference in kind

The effect of cultural inertia over time is that we as a community are much better at adding to our body of knowledge than we are at updating it. When a new product feature or use case comes along, there is a tendency to find a parallel to some existing best practice and piggyback on top of it. If the result functions without obviously breaking anything it becomes part of the cultural fabric, even though the new use case may break fundamental assumptions that the original best practice was founded on.

If the new use case truly is a superset or extension of the old use case, this process results in a sound and reliable new best practice. These kinds of incremental changes are differences in degree: A is like B, but a little more complex.

The culture reacts much differently however to differences in kind where A is nothing like B and interacts with B unexpected ways. These changes are not easily pigeonholed into existing categories and so they force us to question the underlying assumptions that the current best practices are built on. They threaten to invalidate our architecture, our code, our operations manuals and, worst of all, our ability to substitute casual knowledge for deep skills. Differences in kind cost money. They require a business case. They find very few champions willing to campaign for them.

We do not do a very good job of adapting to these paradigm shifts. More often than not, they are absorbed into the culture disguised as incremental changes. This process introduces latent defects and vulnerabilities into our implementations, which accumulate in the form of growing potential risk. Then, when something breaks catastrophically, we wonder how and why it got that bad and why there was no warning. This is the cultural equivalent of rust. What were once "best practices" over time become merely "practices" and eventually they become anti-patterns -- practices that look good at first glance but are actually destructive.


Client vs. bindings connections

Digging back into WebSphere MQ ancient history, one example of a difference in kind was the introduction of the MQSeries client. On the queue manager side of the connection, the API calls are the same and the authorization mechanism is the same so the common practice is to treat a client application like a bindings mode application -- but with an extra channel definition. On the application side, it is possible to take a bindings mode application and run it in client mode without any changes, and in many cases, this is exactly what happened.

But the bindings and client mode are fundamentally different because, in client mode, a channel exists between the application and the queue manager. As administrators and developers, we want to think of this channel as a transparent connection to the queue manager, but it is not. When two queue managers exchange messages over a channel, both sides of the connection are managed by MCAs (message channel agents) that share a complex protocol which insures that persistent messages are serialized, hardened to disk and then acknowledged by the receiver before they are deleted from the sending side. The two MCA processes manage batches of messages and will automatically resync after failure, committing or backing out units of work as required to preserve the integrity of the data. Contrast this with a client application where one side of the connection is a message channel agent and the other is application code. In order to be as reliable as a queue manager-to-queue manager channel, the client application would have to duplicate the channel synchronization logic of the MCA in order to recover from broken connections. For example, if the connection were broken on a COMMIT, there are two possibilities: 1) that the connection was lost before the COMMIT was received by the MCA or 2) that the MCA processed the COMMIT but was unable to transmit the response code back to the application.

In the first case, where the MCA never sees the COMMIT call, the transaction will eventually be rolled back. At this time, any messages PUT under syncpoint will be removed from the queue and any messages dequeued with destructive GET calls will be rolled back onto the queue and eventually redelivered to the application.

In the second case, where the MCA has acted on the COMMIT but could not deliver the response code, there is no transaction to roll back. Messages that were read from the queue with destructive GET calls are permanently removed from the queue and any PUT messages are delivered.

Compounding matters is the fact that pending transactions will remain under syncpoint for an indeterminate period while waiting for TCP to time out the socket. This interval might be measured in seconds or many minutes, depending on the TCP kernel settings. Compare this to a broken connection in bindings mode that is detected almost immediately by the queue manager, which then rolls back the transaction, typically within a few milliseconds. In both cases, the outcome of the transaction is ambiguous until the transaction is rolled back, but the duration of the ambiguity is milliseconds in one case and possibly many minutes in the other.

An application coded to recover from broken connections in bindings mode has a reasonable expectation of reconciling the state of the transaction immediately and continuing processing in an orderly fashion. The same application in client mode must account for the possibility of the transaction remaining outstanding for several minutes before any reconciliation can occur. If the transactions are time sensitive, one of these cases is acceptable and one is not, but to much of the community these are treated as functionally equivalent.

Over time, some new best practices for client applications have emerged to deal with the ambiguous outcomes of broken connections. These include performing all API calls under syncpoint, coding applications on either side of the interface to handle duplicate messages, and resending PUT messages after a broken connection. The situation is not unique to WebSphere MQ and in fact it is addressed in the JMS specification in section 4.4.13 which states:

If a failure occurs between the time a client commits its work on a Session and the commit method returns, the client cannot determine if the transaction was committed or rolled back. The same ambiguity exists when a failure occurs between the non-transactional send of a PERSISTENT message and the return from the sending method.

It is up to a JMS application to deal with this ambiguity. In some cases, this may cause a client to produce functionally duplicate messages.

A message that is redelivered due to session recovery is not considered a duplicate message.

(from Java Message Service - Version 1.1. April 12, 2002)

There is an opposing school of thought which holds that there is no difference between client and binding mode connections. The argument is that an MQRC 2009 CONNECTION_BROKEN response code from a bindings mode connection will have the same ambiguity of outcomes and that the application needs to handle these uniformly regardless of the connection mode. If this really is a difference in kind, as I am arguing, the right approach would be to design client and binding mode applications differently. On the other hand, if this is merely a difference in degree, then the right answer is to design client and bindings mode applications to account for the ambiguous outcome of messaging API calls, just as the JMS specification suggests. The problem is that either way, the prevailing practices get it wrong!

The issue of broken connections surfaced only after the introduction of the MQSeries client. By the time the issue came to light, "best" practices had been established based on an underlying assumption that the outcome of failed transactions would be immediately and reliably detectable. Even though the MQSeries client broke the use case on which the existing coding practices were built, and despite the subsequent development of competing methods which correctly model the underlying issue, the prevailing practices to this day reflect the original model in which no ambiguity of outcomes exists.


The bigger picture

The broken connection issue is just one example of how the WebSphere MQ culture tends to perpetuate established practices even after they are obsolete or demonstrably broken. I picked it because it illustrates how incumbent cultural memes persist in a broken state despite competition from better modeled and more robust methods with significant support in the user community.

There are many other examples, including these:

  • Backup strategy: Most of the best practice documents I have seen, including the IBM Redbook on the subject, recommend backing up the files under the queue manager. This practice models and extends the practice of backing up an application and its configuration details at the filesystem level, and it worked when applications mostly connected to MQSeries in bindings mode, resided on the same server as MQSeries and everything was shut down for the backup. But with consolidation and virtualization, today's queue manager has no maintenance window and is shared among any number of applications, many of which are remote. It is never a good idea to back up the queue manager while it is running, although this happens more often than not and sometimes results in an unusable backup. Even if WebSphere MQ is stopped for the backup, it is impossible to sync the MQ backup with the backups of all the client applications. If the queue manager is ever restored from the backup, the impact to all those applications is unpredictable at best.

  • Eliminating soft limits: In the course of developing a new application, it is quite common to bump into the soft limits such as MAXDEPTH or MAXMSGL. The usual response is to bump these values up to eliminate the "problem." Because there is no need to add code and complexity to deal with these limitations, development can proceed much faster. But this approach treats the soft limits as a nuisance to be eliminated rather than the useful tool they are intended to be. An application hitting one of these limits might be impacted, but at least it has the opportunity to respond sensibly to the problem. Remove the limits and the queue manager is at much greater risk of exhausting the physical resources. When these hard limits are reached, the entire queue manager and all connected applications come to a halt. This is much worse than the temporary and isolated impact to a single application that prompted the change in the first place. This is a case where the references in the written best practice documentation usually recommend the right thing but the community overwhelmingly ignores the advice.

  • Cluster channels named TO.<QMGR>: This practice works in the limited case that there are no overlapping clusters. Now that clusters have become mainstream, overlapping clusters are becoming quite common. This naming convention insures that the CLUSRCVR channel will be shared across all clusters in which a queue manager participates. In this configuration, maintenance in one cluster necessarily impacts the operation of any others, so this practice is one I consider to be a classic anti-pattern. A better approach is to use names like <CLUSTER>.<QMGR> which ensure dedicated channels for each cluster. However, the product manuals still document the TO.<QMGR> convention and it is therefore likely to remain widely practiced for some time to come.

  • Authentication of remote connections: There are a number of common practices that are related to authorization of client connections and channels from other queue managers. A typical example is the oft-repeated advice that the solution for authorization errors from WebSphere MQ Explorer is to place the user's ID into a local group that is sufficiently authorized. The problem with this is it assumes that the ID presented has been authenticated in some way. In fact, WebSphere MQ does not perform any authentication whatsoever. Authentication is delegated to the operating system for bindings mode connections or to a channel security exit for remote connections. SSL might be used to authenticate the channel connection but the identity obtained is not propagated to the API layer unless an exit is present. Because WebSphere MQ authentication is so misunderstood, the prevailing security practices almost universally focus on authorization (setmqaut commands) and ignore authentication. You can run all the setmqaut commands you want, but without authentication the only people bound by them are the honest people that you don't need to worry about. Anyone with malicious intent and access to the network will have no problem bypassing whatever authorization is in place if the authentication is ignored.

This list could go on but I do not want to get lost in the examples. We will have to save those for other articles. My point is that we as a community could do a lot better about embracing cultural change. We should reexamine our practices from time to time and revalidate the underlying assumptions. Then, if we find that a practice no longer models the real world, it should be updated.


Looking to the future

Given how difficult it is to change an established practice, the leverage is in catching the errors up front. If we improve at distinguishing differences in kind from differences in degree when new use cases come along, and at adapting to the truly different use cases, far fewer members of our community will experience sudden, unforeseen, and sometimes catastrophic outages despite having followed all the "best" practices. It is an appropriate time to tackle this issue because forces such as SOA, virtualization consolidation, and regulation are driving architectural changes. The recent release of WebSphere MQ V7.0 included the biggest change in the product API since the initial release of MQSeries. In addition, new products are being layered over WebSphere MQ, such as the HTTP bridge, WebSphere MQ File Transfer Edition, and a recent update of WebSphere MQ Extended Security Edition. Perhaps as we integrate these new technologies we can focus not so much on how similar they are to our existing practices, but rather on how they differ. I will be happy to seed the discussion with a few topics.

Service orientation

The legacy of WebSphere MQ is largely based on point-to-point connectivity, a top-level namespace that resides in the network itself and line-of-business ownership of host and queue manager assets. SOA breaks all of these assumptions. The SOA connectivity model is any-to-any. This drives WebSphere MQ away from point-to-point and toward a clustered configuration. A common idiom in the existing best practices is to use the cluster as the top-level namespace and configure multiple separate clusters to provide namespace isolation and routing. But in an SOA context, the top level namespace is in a registry above the messaging network layer. The closer the MQ topology models the registry namespace, the more transparent and frictionless name resolution becomes as it moves vertically through the layers. Thus, SOA drives WebSphere MQ toward a single clustered namespace modeled after the service registry.

SOA also treats the queues and topics in the cluster as destinations in their own right. The queue manager becomes nothing more than a container that provides life support for destinations. So while the object names are migrating up into the logical layer, the queue manager, channels, processes, and other system names are being driven down closer to the physical infrastructure layer. The result is that it no longer makes sense to name queue managers in a business context; for example, by application name. When the queue manager becomes shared infrastructure, it makes more sense to name it in that context. Naming the queue manager after the host name will help us model the network and locate assets as we drill down from top level logical names into the network. This reverses a trend of moving away from naming queue managers after hosts, which was a common practice a decade ago.

SOA also changes the relationship of queues. The practice of embedding the sending and receiving qualifiers in the queue name is widespread today. In a point-to-point context, this naming convention helped to document the flow of messages in the network. But in an SOA context, it breaks the service-provider/service-consumer model. In an SOA implementation, the only well-known queue is the one that represents the service endpoint. This queue is named for the service that it represents and not the application providing the service. This ties the queue name to the service registry and the logical application layer. The service consumer needs only a reply-to queue, which can be an anonymous dynamic queue or a predefined static queue. If the reply-to queue is static, it is most likely named for the application making the service request. Services can request other services and there is a tendency to reuse the service HLQ (high-level qualifier) when creating reply-to queue names. This should be avoided because it potentially conflicts with or pollutes the namespace in the service registry. Any application that both provides and consumes services should use a separate HLQ for reply-to queues.

In short, application of point-to-point naming conventions in an SOA context tends to result in a point-to-point connectivity model. Although the cluster provides any-to-any connectivity, the names preserve the old style of connectivity but push it up into the logical layer.

Topic-level security

In the new version of WebSphere MQ, topics are first-order objects on par with queues. The setmqaut command is used to grant and revoke authorizations just as it always has for queues, but with a few new options. On the surface, this looks like a difference in degree.

But topics are fundamentally different than queues. The topic name can be extremely long and composed of many arbitrarily long nodes. The setmqaut command, however, works on standard 48-character object names. To create authorizations that are meaningful in the topic namespace, a topic object with a 48-character name is mapped to a specific point in the topic hierarchy. The setmqaut command then grants or revokes authority on the object definition. In order to efficiently resolve authorization requests on topic nodes for which no object definition exists, permissions are inherited down the topic tree from parent to child nodes.

The best practices for authorization of topics have yet to emerge. I do not know yet what they will look like, but I do know that if we treat them as nothing more than extensions of the queue authorization model, the "best" practice will be wrong.

Network topology

In electrical engineering terms, a bus is a shared path by which electrical components can exchange signals in an any-to-any fashion and using a common protocol. The closest analogy in WebSphere MQ terms would be the any-to-any connectivity of a cluster combined with a common message format, such as EDI or SWIFT. But in the IT world, a bus is increasingly understood to mean a central component that provides common services, such as mediation, routing or translation. As a result, the bus concept is driving the adoption of hub-and-spoke topologies at both the physical and logical network layers.

This is significant to WebSphere MQ because authentication of remote connections in MQ occurs at the link. In a point-to-point network, the link-level authentication could be granular. Each node typically hosted no more than one or two related applications so authorizing a link was roughly equivalent to authorizing the application residing there. Compromise of a single node placed a few adjacent nodes at risk.

The hub-and-spoke topology breaks this security model. Every spoke node must be authorized to place messages onto the hub. Similarly, every spoke node must be authorized to receive messages from the hub. The vulnerability here is that a spoke node can address messages not to the input queue at the hub, but rather to the output queue at the hub. Using the hub in this way enables any spoke node to access any destination in the entire network.

The mitigation is to create a separate identity for each spoke node, place that account in the MCAUSER of the inbound channel, and authorize it only to specific service endpoints. The problem with this is that it forces us to embed the authorization policy into the physical network. In addition to being very difficult to manage, it does not fit well within the SOA model, in which authorization policy is managed centrally and independently of the underlying transport.

Setting that aside for a moment, the other problem with this model is that compromise of the hub exposes all of the business assets in the messaging network. It does not necessarily mean that administrative access is possible on any node, but all legitimate destination objects authorized to the hub are vulnerable. This suggests a tiered security model where the spoke nodes are the baseline and the hub is hardened, similar to a gateway.

Ultimately though, these techniques treat the hub-and-spoke topology as a difference in degree when it is actually a difference in kind. Both the topology and the service oriented architecture that drives it are fundamentally different than the point-to-point constructions of the past. They are driving authentication up the stack from the link to the message itself. If we fail to recognize this as a fundamentally different use case and instead apply the old security model to it, the result will be a system which is wide open but perceived to be highly secure. This is worse than no security at all.

IBM's SOA Security Expert Dr. Raj Nagaratnam explained in a recent interview that services are based on a trust model where authorization is delegated to a policy layer external to the application. Indeed, we can no longer assume that the application itself is a single component. It may be composed from several interoperating services. If authorization is to function effectively and efficiently in such a composite application, the identity must be tied to the individual transaction rather than the pipes through which the transactions flow. As SOA matures, message-level authentication technologies such as WebSphere MQ Extended Security Edition will become strategic components enabling the new security model.


Summary

Although I've mentioned some specific examples, I am not suggesting that we need to immediately rename all of our queue managers, rebuild our network topologies, or recode all of our client applications. The category of problem I have described here persists specifically because it tends to stay dormant in the majority of cases, so most of the community is not affected.

But we have been reusing the same best practices for so long, and the underlying model has drifted so far, that the difference represents significant potential risk. The wider this gap is, the more of us are impacted. Extrapolate this process out long enough and the chance of experiencing one of these problems approaches 100%.

What I see repeated over and over on my consulting assignments are cases where customers suffered a major outage despite having diligently applied all the best practices. The examples above were all real-world incidents. It doesn't happen often, but I have seen several occasions where a restore of a queue manager failed because the backup set was unusable. Similarly, most of the cluster outages I have worked on involved overlapping clusters that shared channels named TO.<QMGR>. When it comes to security, the prevailing practices completely ignore authentication. The result is that close to 95% of shops that have been assessed exposed anonymous administrative access.

With all of the change occurring now, the community has an opportunity to close the gap and bring our best practices in line with the use cases currently employed. This need not be expensive, but it will require greater participation, an active dialog within the community, and a willingness to question some of our long-standing traditions. If we examine and refine our best practices in the online forums, we can begin to integrate them as we consolidate and virtualize our data centers, migrate to SOA, and deploy all the new versions and new products. More importantly, we can embrace a cultural change that values adaptability and flexibility in our knowledge management, just as we value these attributes in the systems we design. Agility is not a superficial trait. It runs deep or not at all.

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=335114
ArticleTitle=Mission:Messaging: Embracing cultural change in the WebSphere MQ community
publish-date=09032008