One of the most common questions asked regarding WebSphere MQ Clusters is "How big is a big cluster?”. I thought it would be useful to put together a post side-stepping this question once and for all (at great length - but for those who make it to the end, there may be a sort of answer as a reward).
There are all sorts of factors which will affect how a cluster scales, and before we can begin to tackle what makes a cluster 'big' we need to think about each in turn.
Much has been said elsewhere about choosing Full Repositories (FRs) for a cluster which we won't repeat here, but some basic best practices bear repeating which will particularly affect cluster scalability:
- There should be exactly two FRs for every cluster (probably a topic for a separate post another day)
- They need to be able to connect to every queue manager in the cluster, ideally simultaneously, so must at least be able to support that many channel pairs (see below, ‘Channel Considerations’).
- Avoid using the same FRs for multiple clusters
- For large or busy clusters, avoid hosting application queues on the FRs.
The same considerations apply for all queue managers to a degree, but by far the highest overhead from clustering falls on the Full Repositories (which will need to persist a complete record of every object in the cluster).
In either Full or Partial Repository caches, there are implications for memory requirements, CPU and disk usage on these queue managers. To give a very broad idea, memory/disk will be in the order of a minimum of 1kB for every clustered object (queues, queue managers, and topics) discovered.
There will then also be an overhead of around 0.5kB for every different 'subscription' for those objects (created when a given queue manager notifies the cluster that it is making use of a particular cluster resource). As you can see, the subscription information means that where resources are actually being accessed in the cluster - where applications connect - makes a big difference to the load on the Full Repositories in coordinating this information. Grouping applications accessing the same queues together will mean less of this to maintain, and avoiding overlapping clusters will have a dramatic positive effect, see the Topology Complexity section later.
I do not want to recommend trying to be precise with these measurements, but if a quick estimate based on the above was in the region of 1GB for a given environment it might be considered 'very large'. There is an upper limit of 2GB for the cluster cache, but the practical limit is typically lower depending on platform, version of WMQ and the like. CPU usage, particularly at hourly cluster maintenance intervals, will also increase with larger numbers of objects to maintain.
How many active channels a queue manager can support will vary by platform/version/hardware etc. - some information on the overheads per channel instance can be found in the published performance report support packs.
As mentioned above, there are situations in which Full Repositories will need to contact every queue manager in the cluster to share information. Examples are when object definitions change, or when processing a REFRESH (for example, if a queue manager has to be restored from backup). If not all channels can start simultaneously this is not necessarily a disaster - the information can be forwarded when some other channels have stopped freeing resources. However, for smooth functioning of the cluster it is better to be sure that all cluster state is being shared in a timely manner.
For other queue managers, channel requirements will vary dramatically depending on how your applications and queues are distributed and configured - (see also the next section). Publish Subscribe clusters have particularly high inter queue manager communication requirements - it is best to assume that every queue manager will need to talk to every other queue manager in a 'Pub/Sub Cluster', so typically these will not be able to scale to the same degree.
Workload and application design
Clusters are very deliberately designed so that most information is only shared on a 'need to know' basis with partial repositories. This means that clusters can scale quite well to large numbers (thousands) of queue managers where actual communication between individual queue managers is quite 'sparse'. A good (and typical) example of this is a 'star' topology where a few heavy duty servers in a datacenter host the Full Repositories and certain back-end application queue managers, and a much larger number of smaller servers - often 'in-store' or 'branch' installations - connect back to this hub.
In this situation, most of the queue managers do not need to cache large amounts of cluster data, and need only run a few pairs of channels. To help further regulate this kind of configuration there are some specific workload balancing parameters - in particular CLWLMRUC - which can control how many servers a particular 'satellite' queue manager should use to service application requests.
Placing queue managers in multiple clusters puts significant extra strain on each in the overlapping zone - the 'subscription' overhead discussed in 'System Resources' above is multiplied by the number of overlapping clusters involved. In particular, it is best to avoid one Full Repository hosting multiple large clusters for this reason, or defining large numbers of objects in 'gateway' queue managers which participate in multiple clusters.
So, if we keep all the above in mind, can we come up with a very rough definition of a large cluster? The answer is no - there isn't one definition of 'large'. However, we can make some statements about what might be considered large for a particular 'type' of cluster. Please bear in mind that these are only very approximate guidelines and may vary massively in light of any of the factors above, hardware environment etc. Some deployments will certainly have larger numbers today configured in such a way that they are not seeing any issues.
With those provisos, a 'big WMQ Cluster' might range from:
- A publish subscribe cluster where applications can come and go as they please, dynamically creating topics which are used for many-to-many communications: 5 queue managers
- 50 overlapping clusters (maybe to provide dedicated application channels) involving the same small set of queue managers and a few hundred queues in total: 10 queue managers
- A tightly managed publish/subscribe cluster with a few administratively controlled topics and carefully sized systems which can support all required channel activity: 100 queue managers
- A point to point (queued only), many-to-many mesh in a single cluster: 500 queue managers
- A point to point, carefully managed star topology, with high powered FRs and concentrated application servers: 3000 queue managers
As a final note, however much planning and preparation is done in advance, maintaining a large cluster deployment and continuing to scale outwards is also going to involve continuous monitoring of the system over time to identify bottlenecks and potential problems before they occur. While there are many tools and pieces of documentation out there to help with this, monitoring the cluster repository processes in particular may need to be the subject of a further post: watch this space…