Customer is using MQ Resource Adapter with CCDT on 3rd vendor Application Server and connect to QMGR on z/OS. At z/OS side, it is 2 way Queue Sharing Group, QMGR-A and QMGR-B. At Client side, there are 6 Application Servers running on 4 Unix machines respectively and application is written in MQ Class for Java. If each Application Servers have 20 connection to QMGR, the number of client channels will be 6 * 4 * 20 = 480.
Customer want for Application Servers to connect to specific QMGR for their business reason. To do this, they use CCDT with the parameter of AFFINITY=PREFERRED and CLNTWGHT parameter as follows.
- For AppServer-1,3,5, set CLNTWGHT to 99 for CLNTCONN CHANNEL-X which connect to QMGR-A
set CLNTWGHT to 1 for CLNTCONN CHANNEL-Y which coonect to QMGR-B
- For AppServer-2,4,6, set CLNTWGHT to 99 for CLNTCONN CHANNEL-Y which connect to QMGR-B
set CLNTWGHT to 1 for CLNTCONN CHANNEL-X which connect to QMGR-A
In this way, AppServer-1,3,5 usually connect to QMGR-A, AppServer-2,4,6 will connect to QMGR-B unless some connection problem occurs. This is the customer's expectation. However, customer found AppServer-1,3,5 connect to QMGR-B, and AppServer-2,4,6 connect to QMGR-A frequently.
There should be some connection problem, we assume one of the following occurs.
1. QMGR reject the connection by some reason
2. z/OS TCPIP or Hardware (OSA) reject the connection by some reason
3. Some Intermediate node reject the connection
4. TCPIP on the client side connect to the other QMGR by some reason
5. CCDT doesn't work and client connect to the other QMGR
In the first place, we suspected backlog for TCPIP listener on z/OS QMGR. This is because the value of backlog in TCPIP profile was 10 which seems to be small (For current version of z/OS, the default value is 1024, but customer still use the property file which was used at the earlier version, and the default value of the earlier version is 10). It is known that too many connection request is issued at the same time in this environment.
We asked customer to issue NETSTAT command after the problem occurred to know if the value of backlog is appropriate, but the value of ConnectionsDropped shows zero. ConnectionsDropped is the number of connection requests that have been received by the server and dropped because the maximum number of connection requests was already in the backlog queue. Hence, this is not the case.
Since we don't have any idea, we determined to take network trace to identify the problem. When we looked into the trace, we found interesting behavior. Trace shows as follows.
Seq Time Src addr port Dest addr port Info
----- ---------- -------- ----- --------- ----- ----------------------------------------------------------------------------
69480 1554656083 x.x.x.x 63051 y.y.y.y 1414 MQ 106 MQDISC C.R=1.0
69481 1554656083 x.x.x.x 1414 y.y.y.y 63051 MQ 106 MQDISC_REPLY C.R=1.0
69486 1554656083 x.x.x.x 1414 y.y.y.y 63051 1414 → 63051 [FIN, PSH, ACK] Seq=4441 Ack=4529 Win=4094 Len=0
69487 1554656083 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [ACK] Seq=4529 Ack=4442 Win=32768 Len=0
69690 1554656083 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [FIN, ACK] Seq=4529 Ack=4442 Win=0 Len=0
69693 1554656083 x.x.x.x 1414 y.y.y.y 63051 1414 → 63051 [PSH, ACK] Seq=4442 Ack=4530 Win=4094 Len=0
84032 1554656086 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [SYN] Seq=0 Win=32768 Len=0 MSS=1460 WS=1 [Port numbers reused]
86673 1554656087 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [SYN] Seq=0 Win=32768 Len=0 MSS=1460 WS=1 [TCP Retransmission]
92038 1554656088 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [SYN] Seq=0 Win=32768 Len=0 MSS=1460 WS=1 [TCP Retransmission]
103457 1554656090 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [SYN] Seq=0 Win=32768 Len=0 MSS=1460 WS=1 [TCP Retransmission]
127438 1554656094 x.x.x.x 63051 y.y.y.y 1414 63051 → 1414 [SYN] Seq=0 Win=32768 Len=0 MSS=1460 WS=1 [TCP Retransmission]
<Port 1414 is z/OS QMGR, 63051 is client application>
Obviously, SYN packet was not accepted by z/OS and it was going to re-transmission (Seq no 84032 to 127438), which finally resulted in re-transmission timeout. We focused FIN packet in the previous connection (Seq no 69486 to 69693). FIN was caused by MQDISC from application request (Seq no 69480 to 69481). But just 3 sec after closing the connection, the same port number was reused at next SYN.
In this case, z/OS side issue FIN by active close, and client side issue FIN ACK, then finally z/OS sent ACK. At this stage, the status of the connection on z/OS side should be TIME_WAIT. The last ack must be reached to client side, and to assure this, z/OS side must be stay TIME_WAIT status during some specific time (2MSL), and in case of TIME_WAIT status, new connection is not accepted.
The value of TIMEWAITINTERVAL in TCPIP profile on z/OS is set to 60 in TCPIP profile. So, the connection on z/OS side is in TIME_WAIT status for 120 sec (60 * 2). In this case, client side issue SYN just after 3 sec. This is why we can see re-transmission timeout in the network trace. Since connection request is not accepted, MQ selected the other CLNTCONN CHANNEL according to the definition in CCDT. As Knowledge Center states, "Each connection in the process attempts to connect using the first definition in the list. If a connection is unsuccessful the next definition is used. Unsuccessful definitions with client channel weight values other than 0 are moved to the end of the list." which is why client side connect to the other QMGR.
The solution may be not to disconnect. If client application doesn't issue MQDISC, this problem doesn't occur. This may include the tuning of connection pool for MQ Classs for Java. It might be resolved by reducing the value of TIMEWAITINTERVAL as well, but it is system wide parameter and affects all TCPIP application, not only for MQ. Another idea is to investigate why sending port is reused in such short period. In this case, LOCLADDR parameter is not specified in CCDT, so sending port is not defined in terms of MQ, hence MQ code doesn't issue bind() for specific port. Thus, sending port should be selected by Unix OS as ephemeral port. Reusing sending port in 3 sec seems to be strange behavior (or problem?) of TCPIP on client side.