APAR status
Closed as program error.
Error description
Under rare circumstances all quorum nodes could be expelled in case the current cluster manager is expelled due to an error on network level on the current cluster manager node, which results in a cluster wide quorum loss. This applies only in the case RDMA has been activated and all GPFS RPCs are going over RDMA (verbsPorts,verbsRdma and verbsRdmaSend must be set). The current cluster manager will be expelled due to the network error (as expected) and the new elected cluster manager cannot make progress during its following group protocol, because it waits for a 10 seconds linger timeout down in the CCR, when a cached socket connection to the former cluster manager gets closed. This way all quorum nodes will be expelled.
Local fix
Problem summary
Under rare circumstances all quorum nodes could be expelled in case the current cluster manager is expelled due to an error on network level on the current cluster manager node, which results in a cluster wide quorum loss. This applies only in the case RDMA has been activated and all GPFS RPCs are going over RDMA (verbsPorts,verbsRdma and verbsRdmaSend must be set). The current cluster manager will be expelled due to the network error (as expected) and the new elected cluster manager cannot make progress during its following group protocol, because it waits for a 10 seconds linger timeout down in the CCR, when a cached socket connection to the former cluster manager gets closed. This way all quorum nodes will be expelled.
Problem conclusion
Benefits of the solution: Remaining quorum nodes keep still active and starting a new cluster manager election in case the current cluster manager has been expelled due to an network error when RDMA is used for all GPFS RPCs. The new elected cluster manager can finish its group protocol in time. Work around: Not available. Problem trigger: Error inject on network level (daemon IP address) of the current cluster manager node when RDMA is used for all GPFS RPCs. Symptom: Node expel/Lost Membership/Quorum loss. Platforms affected: x86_64-linux at least Functional Area affected: -RDMA -Cluster Membership -Cluster Manager Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ25511
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
505
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-06-15
Closed date
2020-06-15
Last modified date
2020-06-15
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
12 August 2020