IBM Support

IJ43164: FOR A CLUSTER WHICH HAS THE VERBSRDMASEND CONFIGURATION ENABLED,AN RPC MESSAGE COULD HANG AND POSSIBLY RESULT IN A DEADLOCK.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • If verbsRdmaSend configuration is enabled,
    and the verbs connection
    is disconnected and reconnected
    due to any error other than node
    shutdown or node failure, it may
    cause some RPC reply messages to be
    left in the internal table unintentionally.
    These messages will
    remain in the internal table forever,
    as none of ack messages can
    clean them up. Deadlock will not
    occur immediately, because these
    RPC messages have been processed
    correctly. However, the problem
    may occur when the 32-bit message IDs are wrapped and reused.
    Some new messages may be recognized as duplicated RPCs and be
    rejected by the destination node. These new messages will stay
    in 'pending' state and result in deadlock.
    

Local fix

  • Recycle GPFS daemon.
    

Problem summary

  • If verbsRdmaSend configuration is enabled,
    and the verbs connection
    is disconnected and reconnected
    due to any error other than node
    shutdown or node failure, it may
    cause some RPC reply messages to be
    left in the internal table unintentionally.
    These messages will
    remain in the internal table forever,
    as none of ack messages can
    clean them up. Deadlock will not
    occur immediately, because these
    RPC messages have been processed
    correctly. However, the problem
    may occur when the 32-bit message IDs are wrapped and reused.
    Some new messages may be recognized as duplicated RPCs and be
    rejected by the destination node. These new messages will stay
    in 'pending' state and result in deadlock.
    

Problem conclusion

  • This problem is fixed in 5.1.5 PTF 1
    To see all Spectrum Scale APARs and
    their respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    
    Benefits of the solution:
    No more deadlock
    
    Work around:
    Recycle GPFS daemon.
    Problem trigger:
    For a cluster which has the verbsRdmaSend configuration
    enabled, this problem may occur if the verbs connection
    is disconnected and reconnected due to any error other
    than node shutdown or node failure
    (for example because of network issue).
    Symptom:
    Hang/Deadlock/Unresponsiveness/Long Waiters
    Platforms affected:
    ALL Linux OS environments
    Functional Area affected:
    RDMA
    Customer Impact
    Critical
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ43164

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    515

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-09-27

  • Closed date

    2022-09-27

  • Last modified date

    2022-09-27

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"515","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
27 September 2022