z/OS WebSphere MQ: What Goes Up Must Come Down
MarkWomack 270000PC6X Comments (2) Visits (4887)
I thought it'd be a good idea to take a look at problems with starting up WebSphere MQ and shutting it down. These types of problems usually have totally unrelated causes and so are resolved in totally different ways. Let's start with how to stop a queue manager.
There are three modes available to stop MQ, the first being by using the default of QUIESCE. The QMGR can also be stopped by FORCE, or with MODE(RESTART). QUIESCE mode will allow all programs currently running to finish processing before the QMGR stops. No new programs can start and any connections that MQ has to other address spaces have to end beforehand. If any such connections need to be broken, z/OS commands can be used to accomplish this.
The more powerful MODE(FORCE) doesn't wait for programs to finish processing but terminates them (including utilities) right away; which can leave some work in-doubt. This is an option during troublesome conditions (for instance when the active logs are full, archiving is not working or in cases where the z/OS CANCEL command might otherwise be used).
Lastly, the RESTART flavor terminates the QMGR much in the same way that MODE(FORCE) would except the QMGR is *not* deregistered from the ARM (Automatic Restart Manager) component of z/OS. This means the MQ subsystem can be restarted in place if abending or can be restarted on another z/OS image if z/OS is having a problem.
In most instances some flavor of the STOP command will end the queue manager's life but, for some types of the STOP command there are some obscure points to make. For one, when a QMGR is stopped with MODE(FORCE) you can expect to see an ABEND6C6 if - previously - the queue manager had been subject to a STOP MODE(QUIESCE). The abend means the quiesce shutdown processing has been terminated by the force shutdown and in this case no dump is produced. No reason code is presented with this abend type.
Use of a FORCE STOP can mean that a previous STOP QMGR MODE(QUIESCE) either was taking too long or never was going to complete at all. While there certainly have been defects to cause such shutdown hangs, we've also seen where the shutdown procedure has prevented the QMGR from terminating. When stopping the QMGR with MODE(QUIESCE) check to see if any panels (CSQOREXX) are still held open by anyone. If so, these will prevent the QMGR from terminating. When we receive dumps to Level 2 we'll immediately check to see if any BATCH connections are left open to the QMGR and if so, we'll know these will prevent termination. In a recent case a CICS transaction which had started some 41 hours previously prevented a STOP (mode QUIESCE) from ending while MQ waited for that transaction in CICS to end. So, if a QUIESCE has been used, it's important to note that MQ will wait for programs running to finish. If they don't complete their work then MQ will not be able to complete its QUIESCE shutdown.
So how can you tell if an existing connection is preventing shutdown completion. Well, in Version 7.1.0 at queue manager termination MQ issues the following command so you can see what's holding things up; ie. DISPLAY CONN(*) TYPE(CONN) ALL WHERE (APPLTYPE NE SYSTEMAL). This will return the existing connections to the queue manager. Note that SYSTEM connections are those with an APPLTYPE of SYSTEM or CHINIT. APPLTYPE NE SYSTEMAL differs from a display of APPLTYPE NE SYSTEM since SYSTEM only includes queue-manager threads, whereas SYSTEMAL(L) includes queue-manager and channel initiator threads. Thus APPLTYPE NE SYSTEMAL will return threads which are NOT from the queue-manager or channel initiator. It's those other returned threads which could prevent shutdown.
For lower levels of the product an operator could manually issue a slightly different flavor of this command (DISPLAY CONN(*) TYPE(CONN) ALL WHERE (APPLTYPE NE SYSTEM) to determine more about threads still in existence at the time of shutdown.
The dynamics for starting WebSphere MQ are quite different however. Successful QMGR startup depends on having the right pieces in place. This includes setting up the BSDS so that the log inventory can be kept. A BSDS that has become corrupted can lead to READ or WRITE errors that will prevent MQ from functioning properly. Log content works hand-in-hand with the BSDS so their content should agree with the information the BSDS holds. In cases where a discrepancy exists, MQ will generate appropriate error messages and provides a change log inventory utility so that you can align the BSDS contents with the actual log inventory. This may not be the only anomaly that can disrupt QMGR startup. Procedural missteps can cause issues such as unequal timestamps to exist in dual BSDS data sets; or problems where the BSDS can't be opened at all. The information center documents any problems that we expect the BSDS could encounter and how to fix those issues. Searching on "BSDS Problems" in the informatoin center takes the reader directly to those symptoms with linked solutions.
It is very highly recommended that advantage be taken to use dual logging, dual BSDS data sets, and archiving. Recoverability of the QMGR and data is improved manyfold when these kinds of failsafes are used. If you still run into problems, searching the information center for "Active Log Problems" or "Archive Log Problems" will take you directly to the bulleted solutions you'll need in order to get past startup problems caused by these data sets.
There are also some odds and ends, rarely seen though, which can prevent a queue manager from successfully starting. Generally, it's always good to rebuild the system parameter module for the QMGR every time you migrate to a new level of QMGR code. We know that, sometimes, an old parameter module can still be used when you move to a new code level; however, because the level information is internalized, users have no way to know which level of code will prevent the queue manager from starting if the parameter module is not rebuilt. I recently had a case where the client had upgraded several queue managers (that had been running at varying old levels of MQ) to the most current 7.1 code. He found that out of the 10 QMGRs that had been migrated (and on which none of the parameter modules had been rebuilt) that one of them failed to start. That QMGR continually generated the message CSQY019E indicating the parameter module was at an invalid level. The client immediately thought (since 9 QMGRs started fine, and 1 did not, and all had been migrated to the same 7.1 level) that a defect must be at hand. It certainly took some digging to find out that a boundary (from the very old MQSeries 5.2) had been crossed which was the level that the old parameter module had originally been built at. That old build however, had an internal level which version 7.1 code flagged as invalid. All of the other queue managers were using parameter modules that had been built at Version 6 (and modules built at that level pass the 7.1 test for currency). Needless to say, the client was glad once that one was figured out. The moral of the story was to always rebuild the parameter module ensuring that the macro-name is recompiled with the same level of code that the queue manager is running with. This happens to be Task 17 in the z/OS System Setup Guide.
Command prefixes fall into the odds and ends category too. Command prefixes tell the operating system which subsystem to route a command to; but it does happen that the command prefix is unknown. When this is the case, it becomes impossible to start the QMGR since z/OS has no idea where the START QMGR command should be routed to. The Migrating WebSphere MQ portion of the information center gives direction on the requirements for SYS1.PARMLIB in order to properly define the MQ subsystem as does Task 12 in the Setup Guide.
One other startup failure caused by configuration relates to the setting for MQ's use of 64-bit (or above the bar) storage. MEMLIMIT set to 2 GB on the QMGR started proc (and a REGION of 0M) most often works just fine. MQ will try to store existing control blocks into this above the bar storage, but if our 64-bit allocations are not allowed then the QMGR may fail to start. Sometimes the indications that MEMLIMIT needs to be set are unclear, and the diagnostics produced require much review in order to piece the root cause together. Given this, it's always best to either set up MEMLIMIT within the started task, or alternatively make sure that the limits provided by SMFPRMxx (in SYS1.PARMLIB) give the queue manager enough space to make its allocations. The IEFUSI exit can also be used to provide a default limit for jobs using virtual storage above the bar.
For now that's about all I can think could be keeping the queue manager from holding its CHIN high (groan). With a bit of review of the information center and performance of the set up tasks, the queue manager should start and shut down as expected and provide years of productive message queueing.