The most interesting conversations I've had about WebSphere MQ haven't been solely about MQ at all but rather about MQ and how it interacts with so many other products, and there are gobs of them. What level those other products are at can make a big difference in how MQ behaves. Take the Domain Name System or DNS for instance. Beginning in z/OS 1.11 the Resolver DNS Cache allowed the option of system-wide caching of DNS responses. The benefit ? Well, great performance improvement since requests for lookups didn't result in repetitve flows out to the DNS servers. This was a default visible to the entire z/OS image. Depending on how the admin has configured the DNS determines how long this cached information is considered reliable and that information is returned with the query. In fact the configuration can be set to allow cached information to be considered valid forever.
For MQ, the whole mechanism behind getting a name (gethostbyaddr) or getting an IP address (gethostbyname) is a single threaded one, handled by the MQ Nameserver task which stacks up the requests one by one. As a resolver reply comes back, one request is peeled off of the internal MQ nameserver queue and then the next one is serviced. It sounds simple enough. The DNS configuration is made known to MQ through the TCPIP.DATA data set. If its contents are set up improperly resolution processing will likely be impacted. Proper set up is crucial for out-of-the-box MQ because calls will be made not only at channel start-up time but also when channels are being shut down. As well, the calls can either be for forward or reverse (IP address to hostname) lookups. If reverse lookup processing is not functioning, issuing an NSLOOKUP against a valid IP address should time out. Use of NSLOOKUP has its limitations. It will provide information about the server queried for the information returned however it does not query local host tables - so if the function of lookups is dependent on such tables this won't be reflected. To more aptly micmic how applications like MQ use DNS the 'host' command can be executed from within the unix shell where either the IP address or hostname are included.
So what can happen with WebSphere MQ if something in the name resolution configuration is out of kilter ? Any one of a number of things. In TCPIP.DATA the RESOLVERTIMEOUT statement defaults to 30 seconds which is how long the resolver will wait for a response while trying to communicate with the name server when using UDP packets to solicit information from that server. Note that as of z/OS 1.12 the default for RESOLVERTIMEOUT had dropped to just 5 seconds. It was realized that if a nameserver was not replying within just a few seconds then it was unlikely to ever reply back. Plus, the bigger the RESOLVERTIMEOUT, the longer the resolver will delay cycling through the known DNSs listed in the NSINTERADDR (possibly delaying getting to a server that knows the answer). In addition to this configurable parm, for a forward lookup the value of DOMAINORIGIN determines if a domain name should be appended to the hostname used in the current query. If so, this domain origin is appended to the hostname during one of the search attempts but not included on the other. It should be noted that, prior to z/OS 1.10, if the DNS config was updated to change from DOMAINORIGIN to SEARCH for long running applications (such as WebSphere MQ) then normally this change would be realized by issuing a MODIFY resolverprocname, REFRESH to pick up the new TCPIP data; however, this was found not to be the case. A restart of the application is requisite at those levels for such a change.
The effect of configuring the DNS may not affect all calls similarly. If RESOLVERUDPRETRIES is set (or defaults) to 1 then only requests made via UDP datagrams are considered when the resolver attempts to contact each of the name servers configured to be contacted. For gethostbyaddr() only a single attempt is made to resolve the lookup. gethostbyname() on the other hand can result in 2 searches (one against the hostname including the domain name and one without). If calls seem to be taking an unusual amount of time to finish a review of the nameserver boot file and other nameserver files will help to show the nameserver configuration. If forwarders are being used they may not be responding back to the nameserver leading to delays. Actions to remediate or mitigate such delays can include (a) changing the nameserver order, (b) updating the nameserver boot file to remove any references to the suspected forwarders, (c) or just going to the root name servers in order to get responses to the requested call information. There are cases where nameservers reference forwarders which are no longer on the network and so call resolution suffers. It's easy to see why delays in call reply can snowball. For reverse lookups, if just a 16 second resolver timeout is set and 100 channels are attempting to start, if no reply is received this would amount to a wait of 26.6 minutes for that last channel (100*16 = 1600 seconds). In light of this, and as a stop-gap measure, in MQ V6 code was added to prevent the issuance of non-essential calls if necessary. Activation of this code requires Level 2 support involvement since a service parm must be provided. If caching is being used it's unlikely that use of this stop-gap measure would have any effect since this code strictly involves new lookups destined to the nameserver.
Because MQ uses a single nameserver task even calls that don't actually have to use network flows to the DNS can suffer since these also are queued onto the same request queue that may already be bottlenecked. These other types of calls include conversion calls like inet_pton() and inet_ntop which simply change the format of network addresses into a structure understood by a given address family. While these calls never need leave MQ, if they're held up behind other calls then channels totally disassociated from each other can get held up. MQ makes use of a finite number of channel slots (configured by the user and only bound by the amount of storage available to hold them). These channel slots are not only used by channels which are running and active, but also by those which are attempting to become so. If a call is pending to the DNS a slot will be used until the call either receives a reply, or not. If an eventual channel start is unsuccessful then the slot will be released, but in the meantime a pending DNS call will prevent other channels from starting. If the number of slots becomes depleted then the infamous CSQX014E (Listener exceeded channel limit) message will be returned. Why all of the slots are in use is not always immediately apparent and can have other causes than just nameserver issues. Output from the DIS CHSTATUS(*) ALL command is useful if a check of the channel STATUS and SUBSTATE is made. A substate of NAMESERVER across several channels is always a good indication that your network team should have a look and perhaps plan to collect diagnostics ranging from dumps to resolver CTRACEs.
If documentation does need to be collected and the suspicion is strong that the nameservers are at fault, then it's important to know how best to collect any resolver traces. Refer to the Informational apars, II13399 and/or II13398 in order to determine whether RESOLVER trace output, or SYSTCPRE CTRACE is the best option. RESOLVER trace output can almost be reviewed with a minimal degree of training. If the output is unfamiliar the Search-ForE function within IPCS can be useful to get a count (for instance) of how many timeouts were traced against the DNS. In the Search-ForE panels the trace dataset should be specified. Then the Asis lines should be set up so that the first line contains the type call in question (eg. GetHostByAddr) and if you're interested in timeouts the next Asis line could be set with 'UDP Timer'. The tool, set up like this, would return a quick count of how many timeouts occurred giving an indication that the problem has been confirmed as an issue with nameserver processing. Many such traces make use of 'entry' and 'exit' points so setting the Asis fields to catch the time between entry/exit makes it easy to quantify lags in resolution.
The relationship between WebSphere MQ and DNS processing is an important one. The calls MQ makes enable the product to not only provide channel name information when a connection fails, but also provides Level 2 with important socket information that can be more easily correlated to dump and trace documentation. When the nameserver is working properly you should never see ??? in the WebSphere MQ CHIN joblog beside a channel name.