IT35878: Multiple instances of a highly available MFT agent start as the active instance following a queue manager restart

APAR status

Closed as program error.

Error description

Shortly after two instances of a highly available agent start up
and connect to the agent queue manager, the agent queue manager
is stopped. When the queue manager is restarted, one instance of
the agent reconnects to the queue manger and successfully starts
an the active instance. The other instance reconnects to the
queue manager, and writes the following message to its event log
(output0.log) at regular intervals:

BFGMQ1045I: Agent's system queue
'SYSTEM.FTE.COMMAND.<agent_name>' is configured as either
NOSHARE or DEFSOPT(EXCL).

Local fix

Problem summary

****************************************************************
USERS AFFECTED:
This issue affects all users of MQ Managed File Transfer highly
available (HA) agents.


Platforms affected:
MultiPlatform

****************************************************************
PROBLEM DESCRIPTION:
A highly available Managed File Transfer agent consists of:

- One active instance.
- One or more standby instances.

The first instance of the agent that starts up locks a shared
resource (the SYSTEM.FTE.HA.agent_name queue on the agent queue
manager). When the other instances start, they fail to obtain
the lock and become a standby instance. The standby instances
will then attempt to take the lock at regular intervals, as
specified by the agent property standbyPollInterval - once a
standby instance obtains the lock, then it becomes the active
instance.

After the active instance has locked the shared resource, it
performs its normal startup operations and then starts
processing managed transfers.

Now, if the active instance became disconnected from the agent
queue manager after it had obtained the lock on the shared
resource and before it had completed its initialization, then it
would attempt to reconnect to the agent queue manager at regular
intervals. However, after it had successfully reconnected, it
did not attempt to relock the shared resource - instead, it
would just continue with its initialization processing.

This meant that a standby instance could lock the shared
resource on the agent queue manager and so become an active
instance too. If this happened, there would now be two active
instances of the agent that were trying to initialize at the
same time. The two instances would attempt to access various
system queues on the agent queue manager for exclusive access.
One of the instances would be able to access the system queues
and successfully complete its initialization. The other would
fail to do so, and would write messages similar to the ones
shown below to its event log (output0.log):

BFGMQ1045I: Agent's system queue 'SYSTEM.FTE.COMMAND.agent_name'
is configured as either NOSHARE or DEFSOPT(EXCL).
BFGMQ1045I: Agent's system queue 'SYSTEM.FTE.EVENT.agent_name'
is configured as either NOSHARE or DEFSOPT(EXCL).

These messages would be written to the event log at regular
intervals until the first instance was stopped, at which point
the other instance would be able to open the system queues and
initialize successfully.

Problem conclusion

To resolve this issue, IBM MQ Managed File Transfer highly
available agents have been updated so that if an active instance
becomes disconnected from its agent queue manager:

- After it has obtained the lock on the shared resource.
- And before it has completed its initialization.

then it will attempt to relock the shared resource after it has
reconnected. If the instance is able to relock the shared
resource, then it will remain as the active instance of the
agent, and will continue with its initialization. However, if
the instance fails to lock the shared resource after it has
reconnected, then it will become a standby instance. This
prevents two instances of the agent from being the active
instance.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.2 LTS   9.2.0.3
v9.x CD    9.2.3

The latest available maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT35878
Reported component name
MQ BASE V9.2
Reported component ID
5724H7281
Reported release
920
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-02-11
Closed date
2021-03-25
Last modified date
2021-03-25

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
MQ BASE V9.2
Fixed component ID
5724H7281

Applicable component levels

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"920"}]

Document Information

Modified date:
26 March 2021

Tips

IT35878: Multiple instances of a highly available MFT agent start as the active instance following a queue manager restart

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?