APAR status
Closed as program error.
Error description
long waiters which wait for "RDMA read/write completion fast" because in some cases RDMA requests pending in GPFS internal list may not be processed. Reported in: Spectrum Scale 5.0.2.3 / RHEL 7.5 ( Lenovo DSS-G ) Known Impact: long waiters which can be resolved only with recycling the nodes seeing this waiters. Verification steps: 1) mmchconfig verbsRdmasPerConnectionOverride=4 # to set a small verbsRdmasPerConnection value 2) run workload which does nsd read/write 2) run "mmfsadm verbs |grep checkPost" during the test, it will be good if we can see "checkPost 1" mmdiag --waiters shows NSPDServerIOWorkerThread: for NSPD RDMA read completion fast on node ... when "checkpost 1" Recovery action: recycle mmfsd at nodes showing this waiters. in GNR environment please be aware that recoverygroup takeover can happen or you loose vdisks will be failed if both nodes from one buildingblock see these waiters.
Local fix
Problem summary
Problem description: long waiters which wait for "RDMA read/write completion fast" because in some cases RDMA requests pending in GPFS internal list may not being processed
Problem conclusion
Benefits of the solution: Avoid this long waiter Work Around: None Problem trigger: On a high load nsd server or GSS/ESS server which has verbsRdma enabled, RDMA requests may being queued in list if current in flight RDMA request count of a connection exceeds verbsRdmasPerConnection. In mutex conflict condition, they may not being processed when rdma connection is closed or reconnected, and causes long waiters. Symptom: Long Waiters Platforms affected: ALL Linux OS environments Functional Area affected: RDMA Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ17841
Reported component name
SPEC SCALE ADV
Reported component ID
5737F35AP
Reported release
503
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-07-25
Closed date
2019-08-06
Last modified date
2019-08-06
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE ADV
Fixed component ID
5737F35AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"503","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
06 August 2019