APAR status
Closed as program error.
Error description
An IBM MQ V9.0.0.3 Managed File Transfer agent is connecting to its queue manager using the CLIENT transport, and has a number of resource monitors associated with it that poll directories looking for files that have names which match a specific pattern. When the monitors find a file that matches the pattern, they submit managed transfer requests to transfer that file. After running for a period of time, the agent loses connectivity to its agent queue manager and goes into recovery. When the agent has reconnected to the agent queue manager, the resource monitors associated with the agent fail to submit any managed transfer requests to the agent, even though files that match the specified pattern are put in the directories that the resource monitors are polling.
Local fix
The agent needs to be stopped and restarted after it has lost connectivity to its agent queue manager.
Problem summary
**************************************************************** USERS AFFECTED: This issue affects users of IBM MQ Managed File Transfer (MFT) who have agents that: - Connect to their agent queue manager using the CLIENT transport. - And have been configured to use resource monitors. Platforms affected: MultiPlatform **************************************************************** PROBLEM DESCRIPTION: If an agent had connected to its agent queue manager using the CLIENT transport, and then lost connectivity to the queue manager for some reason (such as a network outage), all of the internal threads within the agent that were communicating with the queue manager at that time would start an internal "TriggerRecoveryThread". Each "TriggerRecoveryThread" would: - Stop any managed transfers that are currently in progress. - Stop any resource monitors within the agent. In addition to this, the first "TriggerRecoveryThread" that was started would create another internal thread, called the "RecoveryThread". This thread would periodically try to reconnect to the agent queue manager. Once the "RecoveryThread" had successfully reconnected, it would restart all of the monitors associated with the agent, and try to resume all of the managed transfers that were stopped when the agent was disconnected from the agent queue manager. The internal locking model within the agent meant that only one thread within an agent could start or stop resource monitors at any one time. As a result, if an agent lost connectivity to its agent queue manager for a very short period of time, then the following sequence of events could occur: - All of the internal threads within the agent that were connected to the agent queue manager created a "TriggerRecoveryThread". - "TriggerRecoveryThread-1" started. After stopping all of the managed transfers that were in progress, the thread obtained an internal lock and began stopping all of the resource monitors. - The other "TriggerRecoveryThreads" ("TriggerRecoveryThread-2", "TriggerRecoveryThread-3", and so on....) started up. There were no managed transfers to stop (as these had already been stopped by "TriggerRecoveryThread-1"), the threads then tried to stop the resource monitors running within the agent. Because "TriggerRecoveryThread-1" had taken the internal lock, all of these threads became blocked. - "TriggerRecoveryThread-1" finished stopping all of the resource monitors, and released the internal lock. - "TriggerRecoveryThread-1" then started a "RecoveryThread", before stopping - "TriggerRecoveryThread-2" now got the internal lock, and tried to stop the resource monitors. - - While this processing was taking place, the "RecoveryThread" reconnected to the agent queue manager. It then became blocked waiting for internal lock held by "TriggerRecoveryThread-2", which it needed in order to restart the resource monitors. - As the monitors had already been stopped by "TriggerRecoveryThread-1", there was nothing for "TriggerRecoveryThread-2" to do, and so it released the internal lock. - The "RecoveryThread" got the internal lock, and restarted all of the resource monitors. After all of the monitors had been restarted, the thread released the internal lock and started resuming any managed transfers that were stopped by "TriggerRecoveryThread-1". - Next, "TriggerRecoveryThread-3" got the internal lock, and stopped the resource monitors. It then released the lock before stopped. - All of the remaining "TriggerRecoveryThreads" performed the same behaviour, and stopped the resource monitors, before they stopped. After all of this processing had taken place, the agent had reconnected to the agent queue manager. However, all of the resource monitors associated with it were in a STOPPED state, which meant that they were no longer performing any polls and submitting managed transfer requests to the agent.
Problem conclusion
To resolve this issue, IBM MQ Managed File Transfer agents have been updated so that only the first internal thread which detects that the agent queue manager is no longer available will create an "TriggerRecoveryThread". The "TriggerRecoveryThread" will start the "RecoveryThread" before exiting. This ensures that once the "RecoveryThread" has restarted the resource monitors associated with an agent, they will remain in a STARTED state and will only be stopped by another "TriggerRecoveryThread" if the agent loses connectivity to its agent queue manager again. --------------------------------------------------------------- The fix is targeted for delivery in the following PTFs: Version Maintenance Level v9.0 LTS 9.0.0.7 v9.1 CD 9.1.3 v9.1 LTS 9.1.0.3 The latest available MQ maintenance can be obtained from 'WebSphere MQ Recommended Fixes' http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037 If the maintenance level is not yet available information on its planned availability can be found in 'WebSphere MQ Planned Maintenance Release Dates' http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309 ---------------------------------------------------------------
Temporary fix
Comments
APAR Information
APAR number
IT28193
Reported component name
IBM MQ BASE M/P
Reported component ID
5724H7261
Reported release
903
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-02-21
Closed date
2019-04-23
Last modified date
2019-06-17
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
IBM MQ MFT V9.0
Fixed component ID
5724H7262
Applicable component levels
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"903","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]
Document Information
Modified date:
17 June 2019