IT34867: Managed File Transfer agents hang when a managed transfer goes into recovery and subsequently times out.

APAR status

Closed as program error.

Error description

When using the "transfer recovery timeout" functionality, an MQ
Managed File Transfer agent that is acting as the source agent
for managed transfers writes the following messages to its event
log when a managed transfer enters recovery:

BFGTR0068I: Transfer ID: <transfer_identifier>, to destination
agent <destination_agent>, is entering recovery, immediately,
due to recoverable error: BFGTR0064I: The transfer sender timed
out waiting for a
response from the transfer receiver
BFGTR0081I: Transfer ID: <transfer_identifier> is now in a
recovering state. The agent will wait for <number> seconds for
the transfer to recover.

After the timeout period specified in the BFGTR0081I message,
the managed transfer has not restarted and has not been
cancelled. Shortly afterwards, the agent stops responding to
commands, and no longer processes any new managed transfer
requests.

Local fix

Problem summary

****************************************************************
USERS AFFECTED:
This affects users of IBM MQ 9.1 Managed File Transfer who have
enabled the "transfer recovery timeout" functionality.


Platforms affected:
MultiPlatform

****************************************************************
PROBLEM DESCRIPTION:
The MQ Managed File Transfer "transfer recovery timeout"
functionality allows users to specify the behaviour of a source
agent if a managed transfer goes into recovery. The
functionality provides three possible options:

1. The agent will continue to try and recover the managed
transfer indefinitely. This the default behaviour if the
"transfer timeout recovery" functionality has not been enabled.
2. The agent will fail the managed transfer as soon as it goes
into recovery.
3. The agent will keep trying to recover the managed transfer
for the specified amount of time. If the managed transfer is
still in recovery when the timeout period elapses, the agent
will mark it as "Failed".

More information on this can be found in the "Setting a timeout
for recovery of stalled transfers" topic in the MQ sections of
IBM Knowledge Center. The URI for this topic in the MQ 9.1.x
section of IBM Knowledge Center is
https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_9.1.0/com.
ibm.mq.adm.doc/recovery_timeout.html.

Now, if the "transfer recovery timeout" functionality has been
enabled to try and recover a managed transfer for a specified
period of time, then when a managed transfer enters recovery the
source agent for that managed transfer will write the message:

BFGTR0081I: Transfer ID: '<transfer_identifier>' is now in
recovering state. The agent will wait for <timeout> seconds for
the transfer to recover.

to its event log (output0.log) and start a new internal thread,
called "TransferRecovery[<transfer_identifier>]". If the managed
transfer resumes during the specified timeout period, the source
agent writes the message:

BFGTR0082I: Transfer ID: '<transfer_identifier>' that was in a
recovering state, has now resumed.

to its event log and stops the internal
"TransferRecovery[<transfer_identifier>]". However, if the
managed transfer has not restarted by the time the timeout
period elapses, then the
"TransferRecovery[<transfer_identifier>]" thread will stop the
managed transfer and mark it as "Failed".

Now, if:

- A managed transfer went into recovery, and the "transfer
recovery timeout" functionality was enabled to try and recover
the managed transfer for a specified period of time.
- And the managed transfer came out of recovery at the same time
as the transfer recovery timeout elapsed.

then it was possible for the following sequence of events to
occur:

- An internal Transfer Sender thread (called
"TransferSender[<transfer_identifier>]") for the managed
transfer started up and began the process of restarting the
managed transfer.
- The internal "TransferRecovery[<transfer_identifier>]" thread
locked the entry for the managed transfer in the agent's state
store, updated the entry to show that the managed transfer had
now entered a "FailedTransferEnding" state and then waited for
the "TransferSender[<transfer_identifier>]" thread to stop.
- Shortly afterwards, the
"TransferSender[<transfer_identifier>]" thread tried to lock the
entry for the managed transfer in the agent's state store, and
became blocked as the entry was locked by the
"TransferRecovery<transfer_identifier>" thread.

This resulted in the managed transfer getting stuck and never
completing.

A Javacore collected from the agent after the issue had occurred
would show entries similar to the ones shown below:

------------------------------------------------------------
3XMTHREADINFO
"TransferRecovery[414d51204d46545037202020202020200cfc985f5ca83d
2b]" J9VMThread:0x00000000022FC600,
omrthread_t:0x00007FE7584F3940,
java/lang/Thread:0x0000000087A895F8, state:CW, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x754C1,
isDaemon:false)
3XMTHREADINFO1            (native thread ID:0x102B, native
priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread
flags:0x00000181)
3XMTHREADINFO2            (native stack address range
from:0x00007FE73C529000, to:0x00007FE73C569000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.018578606 secs,
current category="Application"
3XMTHREADBLOCK     Waiting on:
com/ibm/wmqfte/thread/FTEThread@0x0000000087A89578 Owned by:
<unowned>
3XMHEAPALLOC             Heap bytes allocated since last GC
cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Object.wait(Native
Method)
4XESTACKTRACE                at
java/lang/Object.wait(Object.java:218(Compiled Code))
4XESTACKTRACE                at
java/lang/Thread.join(Thread.java:700(Compiled Code))
5XESTACKTRACE                   (entered lock:
com/ibm/wmqfte/thread/FTEThread@0x0000000087A89578, entry count:
1)
4XESTACKTRACE                at
com/ibm/wmqfte/transfer/impl/TransferSenderImpl.stop(TransferSen
derImpl.java:269(Compiled Code))
4XESTACKTRACE                at
com/ibm/wmqfte/statestore/impl/FTEStateStoreImpl.failTransferonR
ecoveryTimeout(FTEStateStoreImpl.java:6186)
4XESTACKTRACE                at
com/ibm/wmqfte/statestore/impl/FTEStateStoreImpl.handleRecoveryT
imeout(FTEStateStoreImpl.java:6024)
4XESTACKTRACE                at
com/ibm/wmqfte/transfer/impl/TransferRecoveryRunnable.run(Transf
erRecoveryRunnable.java:119(Compiled Code))
5XESTACKTRACE                   (entered lock:
com/ibm/wmqfte/transfer/impl/TransferRecoveryRunnable@0x00000000
87B57800, entry count: 1)
4XESTACKTRACE                at
java/lang/Thread.run(Thread.java:820(Compiled Code))
4XESTACKTRACE                at
com/ibm/wmqfte/thread/FTEThread.run(FTEThread.java:70(Compiled
Code))

------------------------------------------------------------
3XMTHREADINFO
"TransferSender[414d51204d46545037202020202020200cfc985f5ca83d2b
]" J9VMThread:0x0000000002319600,
omrthread_t:0x00007FE76C16DE98,
java/lang/Thread:0x0000000087A89578, state:P, prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x754C0,
isDaemon:false)
3XMTHREADINFO1            (native thread ID:0x102A, native
priority:0x5, native policy:UNKNOWN, vmstate:P, vm thread
flags:0x00020081)
3XMTHREADINFO2            (native stack address range
from:0x00007FE7763BA000, to:0x00007FE7763FA000, size:0x40000)
3XMCPUTIME               CPU usage total: 0.011055033 secs,
current category="Application"
3XMTHREADBLOCK     Parked on:
java/util/concurrent/locks/ReentrantLock$FairSync@0x0000000086C9
0588 Owned by:
"TransferRecovery[414d51204d46545037202020202020200cfc985f5ca83d
2b]" (J9VMThread:0x00000000022FC600,
java/lang/Thread:0x0000000087A895F8)
3XMHEAPALLOC             Heap bytes allocated since last GC
cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at sun/misc/Unsafe.park(Native
Method)
4XESTACKTRACE                at
java/util/concurrent/locks/LockSupport.park(LockSupport.java:186
(Compiled Code))
4XESTACKTRACE                at
java/util/concurrent/locks/AbstractQueuedSynchronizer.parkAndChe
ckInterrupt(AbstractQueuedSynchronizer.java:847(Compiled Code))
4XESTACKTRACE                at
java/util/concurrent/locks/AbstractQueuedSynchronizer.acquireQue
ued(AbstractQueuedSynchronizer.java:881(Compiled Code))
4XESTACKTRACE                at
java/util/concurrent/locks/AbstractQueuedSynchronizer.acquire(Ab
stractQueuedSynchronizer.java:1210(Compiled Code))
4XESTACKTRACE                at
java/util/concurrent/locks/ReentrantLock$FairSync.lock(Reentrant
Lock.java:235(Compiled Code))
4XESTACKTRACE                at
java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java
:296(Compiled Code))
4XESTACKTRACE                at
com/ibm/wmqfte/statestore/impl/FTEMutableCommonState.lock(FTEMut
ableCommonState.java:735(Compiled Code))
4XESTACKTRACE                at
com/ibm/wmqfte/statestore/impl/FTEStateStorePersistence.getMutab
leSenderState(FTEStateStorePersistence.java:1763(Compiled Code))
4XESTACKTRACE                at
com/ibm/wmqfte/statestore/impl/FTEStateStoreImpl.senderTransferR
ecovered(FTEStateStoreImpl.java:2815)
4XESTACKTRACE                at
com/ibm/wmqfte/transfer/impl/TransferSenderRunnable.transferReco
vered(TransferSenderRunnable.java:1568)
4XESTACKTRACE                at
com/ibm/wmqfte/transfer/impl/TransferSenderRunnable.run(Transfer
SenderRunnable.java:568(Compiled Code))
4XESTACKTRACE                at
java/lang/Thread.run(Thread.java:820(Compiled Code))
4XESTACKTRACE                at
com/ibm/wmqfte/thread/FTEThread.run(FTEThread.java:70(Compiled
Code))
------------------------------------------------------------

If, after the issue had occurred, the agent's internal
CommandHandler thread received another message for this managed
transfer (such as a Cancel Transfer request), then this thread
would also become blocked as it would try to access the entry
for the managed transfer in the agent's state store, which was
locked by the TransferRecovery[<transfer_identifier>] thread. In
this situation, the agent would be unable to process any new
managed transfer requests, or complete any in-flight managed
transfers.

Problem conclusion

To resolve this issue, the transfer recovery processing has been
updated so that the internal TransferRecovery thread for a
managed transfer releases any locks it holds on the entry for
that managed transfer in the agent's state store before waiting
for the associated TransferSender thread to stop. This allows
the TransferSender thread to proceed and then stop in a timely
manner, which in turn means that the TransferRecovery thread can
fail the managed transfer as expected.

Note: This issue only affects MQ 9.1 Managed File Transfer. It
does not occur in MQ 9.2, as the issue has been fixed in MQ
9.2.0.0 GA.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.1 LTS   9.1.0.9

The latest available MQ maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT34867
Reported component name
IBM MQ MFT V9.1
Reported component ID
5724H7272
Reported release
910
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-11-10
Closed date
2021-06-24
Last modified date
2021-11-08

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
IBM MQ MFT V9.1
Fixed component ID
5724H7272

Applicable component levels

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"910"}]

Document Information

Modified date:
09 November 2021

Tips

IT34867: Managed File Transfer agents hang when a managed transfer goes into recovery and subsequently times out.

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?