IBM Support

IT40853: RDQM promote and demote cluster resource timeout values are too low

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • In an IBM MQ replicated data queue manager (RDQM) managed
    failover scenario, some cluster resources may require more time
    than is permitted to perform their failover tasks. If a queue
    manager is failing-over from node01 to
    node02, node01 needs to be demoted before node02 may be
    promoted. On a busy system demoting a node may take more than
    the allotted time, resulting in cluster resources failing and
    the queue manager ending.
    
    If this were to occur, the /var/log/messages file would contain
    entries similar to the snippet below:
    
    Nov 28 16:13:32 node01 lrmd[4639]:  warning: p_drbd_qm1_demote_0
    process (PID 66762) timed out
    Nov 28 16:13:32 node01 kernel: [2812690.170750] drbd qm1: State
    change failed: In transient state, retry after next state change
    Nov 28 16:13:32 node01 kernel: [2812690.171950] drbd qm1:
    Failed: role( Primary -> Secondary )
    Nov 28 16:13:32 node01 kernel: [2812690.171985] drbd qm1: State
    change failed: In transient state, retry after next state change
    Nov 28 16:13:32 node01 kernel: [2812690.172629] drbd qm1:
    Failed: role( Primary -> Secondary )
    Nov 28 16:13:32 node01 lrmd[4639]:  warning:
    p_drbd_qm1_demote_0:66762 - timed out after 20000ms
    Nov 28 16:13:32 node01 crmd[4642]:    error: Result of demote
    operation for p_drbd_qm1 on node01: Timed Out
    ...
    Nov 28 16:15:14 node01 kernel: [2812792.101911] drbd qm1: State
    change failed: In transient state, retry after next state change
    Nov 28 16:15:14 node01 kernel: [2812792.102713] drbd qm1:
    Failed: role( Primary -> Secondary )
    Nov 28 16:15:14 node01 kernel: [2812792.102758] drbd qm1: State
    change failed: In transient state, retry after next state change
    Nov 28 16:15:14 node01 kernel: [2812792.103299] drbd qm1:
    Failed: role( Primary -> Secondary )
    Nov 28 16:15:14 node01 lrmd[4639]:  warning: p_drbd_qm1_stop_0
    process (PID 68894) timed out
    Nov 28 16:15:14 node01 lrmd[4639]:  warning:
    p_drbd_qm1_stop_0:68894 - timed out after 100000ms
    Nov 28 16:15:14 node01 crmd[4642]:    error: Result of stop
    operation for p_drbd_qm1 on node01: Timed Out
    

Local fix

  • To get the queue manager to run again, its cluster resources
    need to be cleaned up.
    The command to cleanup all resources is: crm_resource --cleanup
    The command to clean up a named resource is: crm resource
    cleanup
    
    For example, to correct the failed cluster resource in the
    snippet above, run: crm resource cleanup p_drbd_qm1
    

Problem summary

  • ****************************************************************
    USERS AFFECTED:
    All RDQM users with a busy system who wish to perform a managed
    failover.
    
    
    Platforms affected:
    Linux on x86-64
    
    ****************************************************************
    PROBLEM DESCRIPTION:
    The default promote and demote timeout intervals are as low as
    20 seconds which may not be sufficient to complete promote or
    demote tasks on a busy system.
    

Problem conclusion

  • The code has been modified to increase the promote and demote
    timeout intervals to ensure MQ has sufficient time to complete
    the tasks.
    
    ---------------------------------------------------------------
    The fix is targeted for delivery in the following PTFs:
    
    Version    Maintenance Level
    v9.2 LTS   9.2.0.7
    v9.x CD    9.2.2
    
    The latest available maintenance can be obtained from
    'WebSphere MQ Recommended Fixes'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037
    
    If the maintenance level is not yet available information on
    its planned availability can be found in 'WebSphere MQ
    Planned Maintenance Release Dates'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
    ---------------------------------------------------------------
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT40853

  • Reported component name

    MQ BASE V9.2

  • Reported component ID

    5724H7281

  • Reported release

    920

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-05-05

  • Closed date

    2022-12-01

  • Last modified date

    2022-12-01

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    MQ BASE V9.2

  • Fixed component ID

    5724H7281

Applicable component levels

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"920","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
02 December 2022