IT40853: RDQM promote and demote cluster resource timeout values are too low

APAR status

Closed as program error.

Error description

In an IBM MQ replicated data queue manager (RDQM) managed
failover scenario, some cluster resources may require more time
than is permitted to perform their failover tasks. If a queue
manager is failing-over from node01 to
node02, node01 needs to be demoted before node02 may be
promoted. On a busy system demoting a node may take more than
the allotted time, resulting in cluster resources failing and
the queue manager ending.

If this were to occur, the /var/log/messages file would contain
entries similar to the snippet below:

Nov 28 16:13:32 node01 lrmd[4639]:  warning: p_drbd_qm1_demote_0
process (PID 66762) timed out
Nov 28 16:13:32 node01 kernel: [2812690.170750] drbd qm1: State
change failed: In transient state, retry after next state change
Nov 28 16:13:32 node01 kernel: [2812690.171950] drbd qm1:
Failed: role( Primary -> Secondary )
Nov 28 16:13:32 node01 kernel: [2812690.171985] drbd qm1: State
change failed: In transient state, retry after next state change
Nov 28 16:13:32 node01 kernel: [2812690.172629] drbd qm1:
Failed: role( Primary -> Secondary )
Nov 28 16:13:32 node01 lrmd[4639]:  warning:
p_drbd_qm1_demote_0:66762 - timed out after 20000ms
Nov 28 16:13:32 node01 crmd[4642]:    error: Result of demote
operation for p_drbd_qm1 on node01: Timed Out
...
Nov 28 16:15:14 node01 kernel: [2812792.101911] drbd qm1: State
change failed: In transient state, retry after next state change
Nov 28 16:15:14 node01 kernel: [2812792.102713] drbd qm1:
Failed: role( Primary -> Secondary )
Nov 28 16:15:14 node01 kernel: [2812792.102758] drbd qm1: State
change failed: In transient state, retry after next state change
Nov 28 16:15:14 node01 kernel: [2812792.103299] drbd qm1:
Failed: role( Primary -> Secondary )
Nov 28 16:15:14 node01 lrmd[4639]:  warning: p_drbd_qm1_stop_0
process (PID 68894) timed out
Nov 28 16:15:14 node01 lrmd[4639]:  warning:
p_drbd_qm1_stop_0:68894 - timed out after 100000ms
Nov 28 16:15:14 node01 crmd[4642]:    error: Result of stop
operation for p_drbd_qm1 on node01: Timed Out

Local fix

To get the queue manager to run again, its cluster resources
need to be cleaned up.
The command to cleanup all resources is: crm_resource --cleanup
The command to clean up a named resource is: crm resource
cleanup

For example, to correct the failed cluster resource in the
snippet above, run: crm resource cleanup p_drbd_qm1

Problem summary

****************************************************************
USERS AFFECTED:
All RDQM users with a busy system who wish to perform a managed
failover.


Platforms affected:
Linux on x86-64

****************************************************************
PROBLEM DESCRIPTION:
The default promote and demote timeout intervals are as low as
20 seconds which may not be sufficient to complete promote or
demote tasks on a busy system.

Problem conclusion

The code has been modified to increase the promote and demote
timeout intervals to ensure MQ has sufficient time to complete
the tasks.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.2 LTS   9.2.0.7
v9.x CD    9.2.2

The latest available maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT40853
Reported component name
MQ BASE V9.2
Reported component ID
5724H7281
Reported release
920
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-05-05
Closed date
2022-12-01
Last modified date
2022-12-01

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
MQ BASE V9.2
Fixed component ID
5724H7281

Applicable component levels

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"920","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
02 December 2022

Tips

IT40853: RDQM promote and demote cluster resource timeout values are too low

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?