IBM Support

IJ25511: ALL QUORUM NODES COULD BE EXPELLED.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Under rare circumstances all quorum nodes
    could be expelled in case the current
    cluster manager is expelled due to an error
    on network level on the current
    cluster manager node, which results in a
    cluster wide quorum loss. This applies
    only in the case RDMA has been activated
    and all GPFS RPCs are going over RDMA
    (verbsPorts,verbsRdma and
    verbsRdmaSend must be set).
    The current cluster manager will be expelled
    due to the network error (as
    expected) and the new elected cluster
    manager cannot make progress during its
    following group protocol, because it waits
    for a 10 seconds linger timeout down
    in the CCR, when a cached socket
    connection to the former cluster manager gets
    closed. This way all quorum
    nodes will be expelled.
    

Local fix

Problem summary

  • Under rare circumstances all quorum nodes
    could be expelled in case the current
    cluster manager is expelled due to an error
    on network level on the current
    cluster manager node, which results in a
    cluster wide quorum loss. This applies
    only in the case RDMA has been activated
    and all GPFS RPCs are going over RDMA
    (verbsPorts,verbsRdma and
    verbsRdmaSend must be set).
    The current cluster manager will be expelled
    due to the network error (as
    expected) and the new elected cluster
    manager cannot make progress during its
    following group protocol, because it waits
    for a 10 seconds linger timeout down
    in the CCR, when a cached socket
    connection to the former cluster manager gets
    closed. This way all quorum
    nodes will be expelled.
    

Problem conclusion

  • Benefits of the solution:
    Remaining quorum nodes keep still active
    and starting a new cluster manager
    election in case the current cluster manager
    has been expelled due to an network
    error when RDMA is used for all GPFS RPCs.
    The new elected cluster manager can
    finish its group protocol in time.
    Work around:
    Not available.
    Problem trigger:
    Error inject on network level (daemon IP address)
    of the current cluster manager
    node when RDMA is used for all GPFS RPCs.
    Symptom:
    Node expel/Lost Membership/Quorum loss.
    Platforms affected:
    x86_64-linux at least
    Functional Area affected:
    -RDMA
    -Cluster Membership
    -Cluster Manager
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ25511

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    505

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-06-15

  • Closed date

    2020-06-15

  • Last modified date

    2020-06-15

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ26907

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
12 August 2020