IBM Support

IJ17841: LONG WAITERS WHICH WAIT FOR "RDMA READ/WRITE COMPLETION FAST"

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • long waiters which wait for "RDMA read/write completion
    fast" because in some cases RDMA requests pending in GPFS
    internal list may not be processed.
    
    Reported in:
    Spectrum Scale 5.0.2.3 / RHEL 7.5 ( Lenovo DSS-G )
    
    Known Impact:
    long waiters which can be resolved only with recycling
    the nodes seeing this waiters.
    
    Verification steps:
    
    1) mmchconfig verbsRdmasPerConnectionOverride=4  # to set
    a small verbsRdmasPerConnection value
    2) run workload which does nsd read/write
    2) run "mmfsadm verbs |grep checkPost" during the test,
    it will be good if we can see "checkPost 1"
    
    mmdiag --waiters shows
    
    NSPDServerIOWorkerThread: for NSPD RDMA read completion
    fast on node ...
    
    when "checkpost 1"
    
    Recovery action:
    recycle mmfsd at nodes showing this waiters. in GNR
    environment please be aware that recoverygroup takeover
    can happen or you loose vdisks will be failed if both
    nodes from one buildingblock see these waiters.
    

Local fix

Problem summary

  • Problem description:
    long waiters which wait for "RDMA read/write completion fast"
    because in some cases RDMA requests pending in GPFS internal
    list may not being processed
    

Problem conclusion

  • Benefits of the solution:
    Avoid this long waiter
    
    Work Around:
    None
    
    Problem trigger:
    On a high load nsd server or GSS/ESS server which has verbsRdma
    enabled, RDMA requests may being queued in list if current in
    flight RDMA request count of a connection exceeds
    verbsRdmasPerConnection. In mutex conflict condition, they may
    not being processed when rdma connection is closed or
    reconnected, and causes long waiters.
    
    Symptom:
    Long Waiters
    
    Platforms affected:
    ALL Linux OS environments
    
    Functional Area affected:
    RDMA
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ17841

  • Reported component name

    SPEC SCALE ADV

  • Reported component ID

    5737F35AP

  • Reported release

    503

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-07-25

  • Closed date

    2019-08-06

  • Last modified date

    2019-08-06

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ18174 IJ18447

Fix information

  • Fixed component name

    SPEC SCALE ADV

  • Fixed component ID

    5737F35AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"503","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
06 August 2019