IT29729: An MQ classes for JMS application may hang when using automatic client reconnect after network interruptions

APAR status

Closed as program error.

Error description

An MQ classes for JMS application using the automatic client
reconnect function may hang if there are frequent network
interruptions or packet loss between the client and queue
manager systems.  When this occurs, messages are not delivered
to the application and the depth of the MQ queue increases.  A
Javacore (thread dump) of the application JVM will show that
application and internal MQ classes for JMS threads are stuck
with the following callstacks until the JVM is killed and
restarted:

Java callstack of an application thread attempting to consume a
message:

"Application-Thread-1" prio=5 os_prio=0 tid=0x0000000019051000
nid=0x1e8c in Object.wait()
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait
        at
com.ibm.mq.jmqi.remote.api.RemoteHconn.checkForReconnect
        - locked <0x00000000c8fc53b0> (a
com.ibm.mq.jmqi.remote.api.RemoteHconn$ReconnectMutex)
        at
com.ibm.mq.jmqi.remote.impl.RemoteProxyQueue.requestMutex
        at
com.ibm.mq.jmqi.remote.impl.RemoteProxyQueue.requestMessagesReco
nnectable
        at
com.ibm.mq.jmqi.remote.impl.RemoteProxyQueue.requestMessages
        at
com.ibm.mq.jmqi.remote.impl.RemoteProxyQueue.flushQueue
        at
com.ibm.mq.jmqi.remote.impl.RemoteProxyQueue.proxyMQGET
        at
com.ibm.mq.jmqi.remote.api.RemoteFAP.jmqiGetInternalWithRecon
        at com.ibm.mq.jmqi.remote.api.RemoteFAP.jmqiGetInternal
        at com.ibm.mq.jmqi.internal.JmqiTools.getMessage
        at com.ibm.mq.jmqi.remote.api.RemoteFAP.jmqiGet
        at com.ibm.mq.ese.jmqi.InterceptedJmqiImpl.jmqiGet
        at com.ibm.mq.ese.jmqi.ESEJMQI.jmqiGet
        at
com.ibm.msg.client.wmq.internal.WMQConsumerShadow.getMsg
        - locked <0x00000000c8fc5430> (a java.lang.Object)
        at
com.ibm.msg.client.wmq.internal.WMQSyncConsumerShadow.receiveInt
ernal
        at
com.ibm.msg.client.wmq.internal.WMQConsumerShadow.receive
        at
com.ibm.msg.client.wmq.internal.WMQMessageConsumer.receive
        at
com.ibm.msg.client.jms.internal.JmsMessageConsumerImpl.receiveIn
boundMessage
        at
com.ibm.msg.client.jms.internal.JmsMessageConsumerImpl.receive
        at com.ibm.mq.jms.MQMessageConsumer.receive


An MQ classes for JMS "Remote Receive Thread" which is
responsible for reading data sent by the queue manager over a
TCP/IP connection:

"RcvThread:
com.ibm.mq.jmqi.remote.impl.RemoteTCPConnection@1303583763[qmid=
QM1_2019-07-10_15.32.22,fap=13,channel=JMS.SVRCONN,ccsid=819,sha
recnv=10,hbint=300,peer=localhost/127.0.0.1(1414),localport=5025
1,ssl=no]" #334 daemon prio=5 os_prio=0 tid=0x0000000015b93000
nid=0x1ef0 in Object.wait()
java.lang.Thread.State: WAITING (on object monitor)
     at java.lang.Object.wait
     at com.ibm.mq.jmqi.remote.api.RemoteHconn.checkForReconnect
     - locked <0x00000000c8f6e618> (a
com.ibm.mq.jmqi.remote.api.RemoteHconn$ReconnectMutex)
     at com.ibm.mq.jmqi.remote.api.RemoteHconn.getSession
     at com.ibm.mq.jmqi.remote.api.RemoteHconn.getSession
     at com.ibm.mq.jmqi.remote.api.RemoteFAP.spiOpen
     at com.ibm.mq.jmqi.remote.api.RemoteFAP.spiOpen
     at com.ibm.mq.jmqi.remote.api.RemoteHconn.dummyJmqiCall
     at
com.ibm.mq.jmqi.remote.api.RemoteHconn.eligibleForReconnect
     at com.ibm.mq.jmqi.remote.api.RemoteHconn.deliverException
     at
com.ibm.mq.jmqi.remote.impl.RemoteSession.deliverException
     at
com.ibm.mq.jmqi.remote.impl.RemoteConnection.asyncConnectionBrok
en
     - locked <0x00000000c8f293e8> (a
com.ibm.mq.jmqi.remote.impl.RemoteConnection$SessionsMutex)
     at com.ibm.mq.jmqi.remote.impl.RemoteRcvThread.run
     at
com.ibm.msg.client.commonservices.workqueue.WorkQueueItem.runTas
k
     at
com.ibm.msg.client.commonservices.workqueue.SimpleWorkQueueItem.
runItem
     at
com.ibm.msg.client.commonservices.workqueue.WorkQueueItem.run
     at
com.ibm.msg.client.commonservices.workqueue.WorkQueueManager.run
WorkQueueItem
     at
com.ibm.msg.client.commonservices.j2se.workqueue.WorkQueueManage
rImplementation$ThreadPoolWorker.run


The MQ classes for JMS "Remote Reconnect Thread" which is
responsible for creating connection and object handles for JMS
resources used by the application:

"JMSCCThreadPoolWorker-2" #92 daemon prio=5 os_prio=0
tid=0x00000000190e7000 nid=0x1e18 waiting for monitor entry
[0x000000002be7e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at
com.ibm.mq.jmqi.remote.impl.RemoteConnection.removeSession
        - waiting to lock <0x00000000c8f293e8> (a
com.ibm.mq.jmqi.remote.impl.RemoteConnection$SessionsMutex)
        at com.ibm.mq.jmqi.remote.impl.RemoteSession.disconnect
        at com.ibm.mq.jmqi.remote.api.RemoteFAP.jmqiConnect
        at
com.ibm.mq.jmqi.remote.impl.RemoteReconnectThread.reconnect
        at com.ibm.mq.jmqi.remote.impl.RemoteReconnectThread.run
        at
com.ibm.msg.client.commonservices.workqueue.WorkQueueItem.runTas
k
        at
com.ibm.msg.client.commonservices.workqueue.SimpleWorkQueueItem.
runItem
        at
com.ibm.msg.client.commonservices.workqueue.WorkQueueItem.run
        at
com.ibm.msg.client.commonservices.workqueue.WorkQueueManager.run
WorkQueueItem
        at
com.ibm.msg.client.commonservices.j2se.workqueue.WorkQueueManage
rImplementation$ThreadPoolWorker.run

Local fix

Problem summary

****************************************************************
USERS AFFECTED:
This issue affects users of the IBM MQ classes for JMS automatic
client reconnect function.


Platforms affected:
MultiPlatform

****************************************************************
PROBLEM DESCRIPTION:
A JMS Connection and a JMS Session created by an MQ classes for
JMS application each have a connection to a queue manager.
These connections are referred to as "conversations" or
"connection handles" (hConns) and multiple hConns can be
multiplexed over a server-connection channel instance - as
determined by the value of the channel's sharing conversations
property.

A deadlock occurred within the MQ classes for JMS automatic
client reconnect function when a connection error was detected
by the hConn associated with a JMS Session and reconnect
processing was invoked.  This prevented the reconnection
processing from completing.

The "RcvThread" that was associated with the channel instance
(TCP/IP connection) used by the JMS Session's hConn attempted to
verify that its "parent" hConn (the one associated with the JMS
Connection) was either still valid, already reconnected or in
need of reconnection.  This is because the JMS Session must
always connect to the same queue manager as the JMS Connection
from which it was created and uses connection information from
this parent hConn.

In MQ V9.1, it did this by attempting to issue a lightweight MQ
API call to the queue manager because, for the most part, the
hConn associated with a JMS Connection is used as a controlling
hConn for asynchronous consume operations and so few MQ API
calls are issued using it.  Before issuing the MQ API call, the
"RcvThread" took a lock on a list of hConns multiplexed over a
channel instance and checked to see if the hConn was in the
process of being reconnected.  It was and so the "RcvThread"
blocked, waiting for the reconnect to complete.

An internal "RemoteReconnectThread" is responsible for
reconnecting hConns.  It was in the process of attempting to
reconnect a particular hConn and required the lock on the list
of hConns for the channel instance in order to perform some
clean-up.  This was because it initially tried to reconnect the
hConn using an existing channel instance but failed because that
channel instance was in the process of disconnected due to the
original connection error.

The "RemoteReconnectThread" could not obtain the lock which was
held by the "RcvThread", which would not release it until the
reconnect processing was completed by the
"RemoteReconnectThread".

Problem conclusion

Two changes have been made to the MQ classes for JMS.

The first is to ensure that the "RemoteReconnectThread" does not
attempt to reuse old channel instances to reconnect broken
connection handles (hConns).  At the start of each reconnect
cycle, a new connection is created which can then used to
reconnect disconnected hConns.

The second change is to remove the need for the "RcvThread" to
make an MQ API call on a "parent" hConn when a connection error
is detected on a child hConn.  The channel heartbeating function
is sufficient to detect errors on the channel instance
associated with a parent hConn even if it is not being used to
issue regular MQ API calls.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.1 CD    9.1.4
v9.1 LTS   9.1.0.4

The latest available maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT29729
Reported component name
IBM MQ BASE MP
Reported component ID
5724H7271
Reported release
910
Status
CLOSED PER
PE
NoPE
HIPER
YesHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-07-15
Closed date
2019-10-04
Last modified date
2019-10-04

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
IBM MQ BASE MP
Fixed component ID
5724H7271

Applicable component levels

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"910","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
04 October 2019

Tips

IT29729: An MQ classes for JMS application may hang when using automatic client reconnect after network interruptions

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?