The answer to that question is "well, yes pretty much". If a WebSphere MQ channel fails and you're betting on root cause you can likely win that bet if you know that most channels fail because of an underlying problem either within the network, or the general network configuration. While IBM Support is always ready to provide ancillary support on such problems, often the most time is saved if network personnel are involved immediately and can get started on review of network type traces. The tools of the trade include z/OS packet traces, network sniffer traces, and documentation like that provided through the tcpdump facility. While messages not moving may seem like something caused by MQ, our application simply relies on the same transport methods that apps like FTP, SMTP and Telnet use. If the underlying network is unstable then no application will be able to pass any network traffic.
So what kinds of symptoms might represent a problem that's really based in the network?
Well, assume a transmission queue backs up. If you find the CURDEPTH start to grow issue a DIS CHSTATUS(channelname) ALL. Is the channel INDOUBT? The sending side of a channel will be INDOUBT when it has already sent a BATCH of messages but has not yet received an acknowledgement from the other end that the batch was successfully received. RECEIVING channels can never be INDOUBT. Since this means the sender has sent them, the question becomes why the remote end did not acknowledge them. Either the full batch of messages never made it through the network to the receiving side, or, if they did, either (a) the receiver never sent an acknowledgement, or (b) an acknowledgement was sent but *it* never made it back to the sending side. The quickest diagnostic approach here is to run a packet capture on the sending side (if the sender is on the z/OS platform then a z/OS packet trace would be run) and it's crucial that a simultaneous packet capture be run on the receiving side. Such traces have to be run during the same test window otherwise it is not posssible to match sequence numbers or acknowledgements at the packet level. As well, these simultaneous traces can be used to see if either of the stacks has entered retransmit mode. Retransmit mode is a clear indication of a network anomaly where either the endpoints (or middle hops) are failing to pass packets at all, or discarding them in favor of other traffic until a later time. With this kind of tracing, network specialists can pinpoint the troubled hop most quickly.
So, why wouldn't a batch of messages be able to get through the network? There are many reasons. The first task is to find out which hop won't allow passage. Once found, the reason could be something as simple as path MTU discovery. Abbreviated as PMTUD, this standard from the 90s is used to determine the largest packet that can pass through all network hops at the time that the socket connection is first established. For a z/OS stack use of the standard is indicated by coding PATHMTUDISCOVERY within the TCP/IP profile. In order for PATH MTU discovery to be used the other endpoint has to agree to it as well by having a corresponding statement coded in its profile (or stanza) dependent on the platform. Once done, discovery packets are sent with the don't fragment (DF) bit turned on so that routers can not fragment them. Thus, if a packet is too large to pass through a certain router hop then the packet will be dropped and, if ICMP is configured, then an ICMP Type 3, Code 4 message will be sent back to the originator indicating what MTU size it can pass. The originating stack will then send its next discovery packet using this smaller size. This discovery procedure continues until the packets that are sent can pass through all hops without being dropped. Some caveats to PMTUD are these:
- Only IPV4 routers perform fragmentation while IPV6 routers leave fragmentation up to the endpoints. If an IPV6 router receives a packet larger than its MTU it simply drops it. IPV4 routers, thus, are the only ones today supporting PMTUD and in an IPV6 router network the endpoints would provide that support.
- ICMP packets (at least the type used by PMTU) must be allowed in a network in order for PMTUD to work. In some networks the protocol is blocked and so these vital ICMP packets are never seen. They are a great clue to network folks to indicate why a transmit queue may be backing up (but if network security prohibits their flow, then this smoking gun will never be seen).
By the way, a great test to determine if transmit queues are backing up because of packets that are too large is to set the MTU to 576. Packets of 576 should *always* pass within a TCP/IP network because this is the minimum sized datagram that all hosts must be prepared to accept.
OK, so suppose your channels which pass messages in the clear work just fine; but when you enable SSL on them, they don't? Again, it's not likely to be a problem within WebSphere MQ. Dropping the datagram size down again could be a good diagnostic test; but here's what I've seen. SSL connections may be trying to pass larger payloads (that includes SSL headers and authentication codes that can pad the end of a flow). At channel start time certificate exchanges are common. Of course, they need not occur at all if not required; but when certificates do flow there is a chance they may be larger than what any particular hop can pass. In one case I had been sent a trace of the endpoint starting a secure channel, and a simultaneous tcpdump of the SSL server side. What I saw was astounding. All of the certificate data (1480 bytes) was in the trace on the endpoint starting the channel; but when I looked at the same time slice on the remote trace, I saw no data at the same time. Then, I went back to the starting side's trace and looked at the next SSL flow which was the next 576 bytes of certificate data. I returned to the remote trace and looked at that same time slice. Wouldn't you know it, this time, all 576 of those next bytes were there! This made it clear that there was a cap on how much data could pass through some hop in the network. The task for the network specialists now was to continue to run traces of this SSL initialization further and further away from the receiving side until they got to the hop where they could still see that initial flow of 1480 bytes. Once they found that hop through tracing they knew which hop could not pass the data without fragmenting it. In a case like this, channel initialization for WebSphere MQ can simply hang and that will be the only symptom.
So, even if WebSphere MQ isn't the root cause of most channel problems, what can you do in MQ in order to mitigate these issues until your network folks can find that smoking gun? I like to start with those key channel timers like heartbeat, keepalive, and disconnect intervals. Heartbeat intervals that are too small can totally prevent a channel from starting at all. What's too short? Well, it depends on how long it takes for the remote side to reply to the INITIAL DATA flow. When a channel is started a *pre-negotiated* HBINT (based on the sending side's channel definition) is used to determine how long a SELECT call will wait before it times out the connection to the remote side. If the remote side can't send back its INITIAL DATA before that time elapses then MQ will send a CLOSE call to TCP/IP asking it to terminate the socket. This will flow a FIN packet to the remote side killing the connection and preventing channel startup. Since this results in the socket being only half closed, if the remote side later sends its INITIAL DATA then a RESET will be returned to it. So, the HBINT on the sending side should be set large enough to avoid the initiating side from closing its half of the socket too quickly. The *pre-negotiated* heartbeat is twice the non-negotiated heartbeat on the channel definition, so an HBINT of 3 seconds will allow the remote side 6 seconds to respond to an INITIAL DATA channel startup flow. The final negotiated HBINT could be much larger. It's worth mentioning that heartbeat intervals which are too small also can result in a cost to the network as they increase traffic through it.
I've heard it said that the heartbeat interval is MQ's version of the keepalive timer that the TCP/IP stack implements. There is some validity to that since the purposes of both timers is to detect idleness. Still, WebSphere MQ allows you to piggyback off of the stack keepalive timer by coding a KAINT value on a per channel basis. A stack's keepalive timer, required by RFC 1122, is meant to detect cases of (TCP/IP) network layer outages, while the heartbeat interval is meant to detect MQ level outages. Of course, a network outage could precipitate an MQ outage. As one type of outage may lead to another, it's a good thing that the default values of the channel timers end up being tiered. RFC 1122 requires a default minimum keepalive interval of two hours which is fully configurable in stacks that support it. Imagine a socket failing and not receiving an alert for two hours about it. This certainly has happened to many. At an MQ application level (because 2 hours is so long, and changing it in the stack applies to every application) it may make more sense to configure KAINT on a per channel basis. If you don't configure KAINT, its associated interval is calculated based on the negotiated heartbeat. The negotiated heartbeat will often be 300 seconds (or 5 minutes). Since KAINT, if not manually configured, it will have a value of AUTO, this will represent HBINT plus 60 seconds, or typically 6 minutes. If the negotiated HBINT ends up resolving to a value of 0, then it's important to know that the INTERVAL statement in the TCP profile configuration will be used. Beware that the stack's profile might still be defaulting to 2 hours per RFC 1122.
The z/OS stack's implementation of keepalive works through just a little bit of trickery. Suppose you have defaulted to use the stack wide 2 hour default for keepalive. A channel connection between HOST A and HOST B has seen no traffic for, now, 2 hours, because HOST B crashed that long ago. HOST A's stack, realizing this, will now begin to send its first keepalive probe. This keepalive probe will contain a sequence number value which is ONE less than the previous number of bytes that was previously ACKnowledged as received by HOST B. This is done so that the stack on HOST B (if it's still around) is basically forced to respond with an ACKnowledgement to the keepalive while providing information about the NEXT expected sequence number. HOST A will send 10 of these probes, expecting a response; and thus spacing them out approximately 75 seconds from the other. That means this check will continue for about 750 seconds or 12 1/2 minutes. The connection is considered dead after that and any outstanding RECEIVES within WebSphere MQ are to be posted back with a timeout.
By the by, my focus in this blog entry is on z/OS WebSphere MQ, so if you need the function that KAINT provides on a non-Z platform, then HBINT should be used in its place.
So, where does DISCINT (Disconnect Interval) fit in? Well, if a batch of messages is sent and ends, and no new message arrives on the transmit queue which feeds this channel within DISCINT seconds, then the channel becomes inactive. This parameter has special usage for client connections but otherwise its use is fairly standard. DISCINT should be longer than KAINT which should be longer than HBINT.
WebSphere MQ can't prevent a network outage, but parms can be set to aid recoverability and recognition of an outage. Some of the parms above accomplish this. Then there's ADOPTMCA (a z/OS QMGR attribute) and the corresponding distributed ADOPTNEWMCA. These keywords allow a new instance of an existing channel to be built if the other side requests it. The old connection is torn down in favor of the new one. This can be useful in cases where a channel has failed but one or both sides is unaware of it. Attempts to start a channel that won't successfully start can be an indication that these attributes might need to be put into place. Attempting to start a channel on z/OS which fails with CSQX531E indicates objects for that channel to start are already in use; so the channel is already perceived to be running properly even though it's not. DISPLAY QSTATUS will indicate if the transmission queue is already otherwise occupied.
So how long should the receiving end of a message channel wait for its data to arrive from the remote sender? If it waits forever then the connection becomes blocked in a receive wait indefinitely. For these cases ReceiveTimeout improves network availability. Queue manager attributes RCVTIME, RCVTTYPE, and RCVTMIN will put such a timer in place. On a z/OS MQ RCVTIME will be set automatically in one of three ways (depending on how RCVTTYPE is defined). If the set number of seconds elapses then the connection times out and will drop with a CSQX259E message. Receipt of this message is typically an indication that either the remote end is gone or some hop along the network path has failed to pass the expected data to its recipient. There are some caveats to the receive timeout depending on platform and connection type, but all of these spicy details are greatly elaborated on within the WebSphere MQ Information Center.
Have you ever seen channel slots depleted with CSQX014E. Most often I've seen this error caused by DNS failures, or hops within the DNS infrastructure which re-direct DNS queries to hosts which are unresponsive. WebSphere MQ makes DNS calls for channels that are both starting and stopping. If DNS fails to respond, these requests can back up causing MQ's channel slots to be exceeded. Because the nameserver request process within MQ is singly threaded the channel slots can be exhausted quickly. When CSQX014E is first seen, it's a smart move to check whether all parts of the DNS infrastructure are functioning as expected. The DNS should also be checked to confirm it supports reverse lookups. Rick Armstrong has a blog entry on DNS that's worth a look and can give some valuable insight into why MQ channels can be impacted by the DNS: WebSphere MQ Channels waiting on Domain Name Server
If the right steps are taken early and the right network specialists involved, problems in the network that lead to MQ channel fallouts can be fixed quickly and efficiently, preventing any long-term outage.