APAR status
Closed as program error.
Error description
Error Description: In a cluster with firewall configured, mmhealth may report ccr_quorum_nodes_warn or/and ccr_ccr_local_server_warn for some quorum nodes. Reported in: Spectrum Scale 5.1.1.1 on RHEL7 Known Impact: deadlock/daemon crash Verification steps: (1) Run "mmccr check -e" on the problematic node, ping or connect local CCR server may fail. mmccr check results: CCR Client initialization succeed Check CCR authorized key file succeed Check CCR cached directory and file succeed Check both CCR paxos files succeed Ping local CCR server failed (895-Ping local CCR server failed) Check CCR server IP address lookup succeed Ping CCR quorum nodes failed (895-Ping CCR quorum nodes failed) Check files in CCR committed directory failed (895-Connect local CCR server failed) Check CCR tiebreaker disks succeed (No tiebreaker disks configured) (2) Check the GPFS log /var/adm/ras/mmfs.log.latest, you may see logs as below: 2022-02-27_11:21:03.714+1000: [N] Purged 9001 CCR request(s) due to err: 895 (Maximal number of requests in operation queue reached (Check for request jam)) since 2022-02-27_08:59:03.694+1000 2022-02-27_14:28:15.815+1000: [N] Purged 9004 CCR request(s) due to err: 895 (Maximal number of requests in operation queue reached (Check for request jam)) since 2022-02-27_12:05:43.814+1000 2022-02-27_17:36:15.986+1000: [N] Purged 9002 CCR request(s) due to err: 895 (Maximal number of requests in operation queue reached (Check for request jam)) since 2022-02-27_15:14:35.964+1000 2022-02-27_20:46:16.274+1000: [N] Purged 9004 CCR request(s) due to err: 895 (Maximal number of requests in operation queue reached (Check for request jam)) since 2022-02-27_18:23:16.224+1000 (3) Run "netstat -an | grep 1191" on the problematic node, you may see a lot of sockets in CLOSE_WAIT or LAST_ACK status. # netstat -an | grep :1191 ........... tcp 1 0 10.10.10.6:1191 10.10.10.12:40152 CLOSE_WAIT tcp 1 0 10.10.10.6:1191 10.10.10.26:39114 CLOSE_WAIT tcp 1 0 10.10.10.6:1191 10.10.10.21:49593 CLOSE_WAIT tcp 1 0 10.10.10.6:1191 10.10.10.24:55971 CLOSE_WAIT tcp 1 0 10.10.10.6:1191 10.10.10.26:53336 LAST_ACK ........... Recovery action: If you have enough quorum nodes up active, we can # disable the quorum role of this problematic node mmchnode --noquorum -N <problematic node> # monitor "netstat -an | grep 1191", wait until those TCP sockets gone, then we can enable the quorum role back mmchnode --quorum -N <problematic node>
Local fix
N/A
Problem summary
CCR becomes slow on a quorum node when the configured firewall drops the FIN TCP/IP packages of CCR requests.
Problem conclusion
This problem is fixed in 5.1.2 PTF 6 To see all Spectrum Scale APARs and their respective fix solutions refer to page https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_ apars.html Benefits of the solution: This change avoids a slow CCR main thread on quorum nodes in case the firewall drops FIN packages of CCR requests. Work around: Not available. Problem trigger: Misconfigured firewall. Symptom: Performance impact/degradation. Platforms affected: x86_64-linux only so far. Functional Area affected: CCR Customer Impact: High Importance.
Temporary fix
Comments
APAR Information
APAR number
IJ39267
Reported component name
SPEC SCALE ADV
Reported component ID
5737F35AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-04-06
Closed date
2022-06-29
Last modified date
2022-06-29
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE ADV
Fixed component ID
5737F35AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
30 June 2022