IBM Support

IT37970: RDQM queue managers might fail to start after network outage between nodes

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • After a network outage between replicated data queue manager
    (RDQM) nodes, RDQM MQ Queue managers
    might fail to start, due to a known issue in the underlying DRBD
    kernel module.
    
    This issue can result in RDQM resources getting stuck in invalid
    states during resync. This mismatch prevents the queue manager
    from starting. Both nodes will be listed as "Outdated" as below:
    
    QMNAME(QMGR)
    STATUS(Status not available) DEFAULT(no) STANDBY(Not
    applicable) INSTNAME(Installation1) INSTPATH(/opt/mqm)
    INSTVER(9.2.0.1) HA(Replicated) DRROLE(Primary)
    
    qmgr role:Secondary
      disk:Outdated
      node1 role:Secondary
        peer-disk:Outdated
      node 2 connection:Connecting
    
    This will not result in every queue manager failing to start
    after network outages between RDQM nodes but some will get into
    this state.
    

Local fix

Problem summary

  • ****************************************************************
    USERS AFFECTED:
    This problem can affect all RDQM users on Red Hat Enterprise
    Linux x86-64 with a kmod-drbd RPM version less than 9.0.28.
    
    
    Platforms affected:
    Linux on x86-64
    
    ****************************************************************
    PROBLEM DESCRIPTION:
    DRBD kernel module (kmod-drbd) versions before 9.0.28 have a
    defect which can prevent resources from resyncing after a
    network outage.
    

Problem conclusion

  • This issue is caused by a defect in the underlying DRBD kernel
    module pre-requisite for RDQM described as "fix resync decision
    to obey disk states when the generation UUIDs are equal; the
    effect of this bug was that you could end up with two Outdated
    nodes after resync":
    https://lists.linbit.com/pipermail/drbd-user/2021-February/02584
    6.html
    
    When this DRBD problem occurs, RDQM resources will be left in
    an Outdated state which cannot be recovered through normal MQ
    commands.
    
    All MQ 9.2.0 LTS RDQM users should update to MQ 9.2.0.2 or a
    later fix pack.  All MQ 9.2.1 CD RDQM users should update to
    MQ 9.2.2 CD or later.  These levels include kmod-drbd-9.0.28
    or later, per APAR IT35808:
    https://www.ibm.com/support/pages/apar/IT35808
    
    All MQ 9.1.0 LTS RDQM users should update to MQ 9.1.0.8 or a
    later fix pack.  These levels include kmod-drbd-9.0.28 or later,
    per APAR IT36295:
    https://www.ibm.com/support/pages/apar/IT36295
    
    ---------------------------------------------------------------
    The fix is targeted for delivery in the following PTFs:
    
    Version    Maintenance Level
    v9.1 LTS   9.1.0.8
    v9.2 LTS   9.2.0.2
    v9.x CD    9.2.2
    
    The latest available maintenance can be obtained from
    'WebSphere MQ Recommended Fixes'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037
    
    If the maintenance level is not yet available information on
    its planned availability can be found in 'WebSphere MQ
    Planned Maintenance Release Dates'
    http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
    ---------------------------------------------------------------
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT37970

  • Reported component name

    MQ BASE V9.2

  • Reported component ID

    5724H7281

  • Reported release

    920

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    YesHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-08-11

  • Closed date

    2021-11-02

  • Last modified date

    2022-03-09

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    MQ BASE V9.2

  • Fixed component ID

    5724H7281

Applicable component levels

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"920"}]

Document Information

Modified date:
10 March 2022