IJ39267: MMHEALTH REPORTS CCR_QUORUM_NODES_WARN/CCR_CCR_LOCAL_SERVER

APAR status

Closed as program error.

Error description

Error Description:
In a cluster with firewall configured, mmhealth may
report ccr_quorum_nodes_warn or/and
ccr_ccr_local_server_warn for some quorum nodes.

Reported in:
Spectrum Scale 5.1.1.1 on RHEL7

Known Impact:
deadlock/daemon crash

Verification steps:

(1) Run "mmccr check -e" on the problematic node, ping or
connect local CCR server may fail.

mmccr check results:
CCR Client initialization succeed
Check CCR authorized key file succeed
Check CCR cached directory and file succeed
Check both CCR paxos files succeed
Ping local CCR server failed (895-Ping local CCR server
failed)
Check CCR server IP address lookup succeed
Ping CCR quorum nodes failed (895-Ping CCR quorum nodes
failed)
Check files in CCR committed directory failed
(895-Connect local CCR server failed)
Check CCR tiebreaker disks succeed (No tiebreaker disks
configured)

(2) Check the GPFS log /var/adm/ras/mmfs.log.latest, you
may see logs as below:

2022-02-27_11:21:03.714+1000: [N] Purged 9001 CCR
request(s) due to err: 895 (Maximal number of requests in
operation queue reached (Check for request jam)) since
2022-02-27_08:59:03.694+1000
2022-02-27_14:28:15.815+1000: [N] Purged 9004 CCR
request(s) due to err: 895 (Maximal number of requests in
operation queue reached (Check for request jam)) since
2022-02-27_12:05:43.814+1000
2022-02-27_17:36:15.986+1000: [N] Purged 9002 CCR
request(s) due to err: 895 (Maximal number of requests in
operation queue reached (Check for request jam)) since
2022-02-27_15:14:35.964+1000
2022-02-27_20:46:16.274+1000: [N] Purged 9004 CCR
request(s) due to err: 895 (Maximal number of requests in
operation queue reached (Check for request jam)) since
2022-02-27_18:23:16.224+1000

(3) Run "netstat -an | grep 1191" on the problematic
node, you may see a lot of sockets in CLOSE_WAIT or
LAST_ACK status.
# netstat -an | grep :1191
...........
tcp        1      0 10.10.10.6:1191
10.10.10.12:40152      CLOSE_WAIT
tcp        1      0 10.10.10.6:1191
10.10.10.26:39114      CLOSE_WAIT
tcp        1      0 10.10.10.6:1191
10.10.10.21:49593      CLOSE_WAIT
tcp        1      0 10.10.10.6:1191
10.10.10.24:55971      CLOSE_WAIT
tcp        1      0 10.10.10.6:1191
10.10.10.26:53336      LAST_ACK
...........

Recovery action:
If you have enough quorum nodes up active,  we can
# disable the quorum role of this problematic node
   mmchnode --noquorum -N <problematic node>
# monitor "netstat -an | grep 1191",  wait until those
TCP sockets gone,  then we can enable the quorum role
back
   mmchnode --quorum -N <problematic node>

Local fix

```
N/A
```

Problem summary

CCR becomes slow on a quorum node when the
configured firewall drops the FIN
TCP/IP packages of CCR requests.

Problem conclusion

This problem is fixed in 5.1.2 PTF 6
To see all Spectrum Scale APARs and
their respective fix solutions refer to page
https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
apars.html

Benefits of the solution:
This change avoids a slow CCR main
thread on quorum nodes in case the firewall
drops FIN packages of CCR requests.

Work around:
Not available.
Problem trigger:
Misconfigured firewall.
Symptom:
Performance impact/degradation.
Platforms affected:
x86_64-linux only so far.
Functional Area affected:
CCR
Customer Impact:
High Importance.

Temporary fix

Comments

APAR Information

APAR number
IJ39267
Reported component name
SPEC SCALE ADV
Reported component ID
5737F35AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-04-06
Closed date
2022-06-29
Last modified date
2022-06-29

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE ADV
Fixed component ID
5737F35AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
30 June 2022

Tips

IJ39267: MMHEALTH REPORTS CCR_QUORUM_NODES_WARN/CCR_CCR_LOCAL_SERVER_WARN

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?