APAR status
Closed as program error.
Error description
If verbsRdmaSend configuration is enabled, and the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure, it may cause some RPC reply messages to be left in the internal table unintentionally. These messages will remain in the internal table forever, as none of ack messages can clean them up. Deadlock will not occur immediately, because these RPC messages have been processed correctly. However, the problem may occur when the 32-bit message IDs are wrapped and reused. Some new messages may be recognized as duplicated RPCs and be rejected by the destination node. These new messages will stay in 'pending' state and result in deadlock.
Local fix
Recycle GPFS daemon.
Problem summary
If verbsRdmaSend configuration is enabled, and the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure, it may cause some RPC reply messages to be left in the internal table unintentionally. These messages will remain in the internal table forever, as none of ack messages can clean them up. Deadlock will not occur immediately, because these RPC messages have been processed correctly. However, the problem may occur when the 32-bit message IDs are wrapped and reused. Some new messages may be recognized as duplicated RPCs and be rejected by the destination node. These new messages will stay in 'pending' state and result in deadlock.
Problem conclusion
This problem is fixed in 5.1.5 PTF 1 To see all Spectrum Scale APARs and their respective fix solutions refer to page https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_ apars.html Benefits of the solution: No more deadlock Work around: Recycle GPFS daemon. Problem trigger: For a cluster which has the verbsRdmaSend configuration enabled, this problem may occur if the verbs connection is disconnected and reconnected due to any error other than node shutdown or node failure (for example because of network issue). Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters Platforms affected: ALL Linux OS environments Functional Area affected: RDMA Customer Impact Critical
Temporary fix
Comments
APAR Information
APAR number
IJ43164
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
515
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-09-27
Closed date
2022-09-27
Last modified date
2022-09-27
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"515","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
27 September 2022