IBM Support

IJ39267: MMHEALTH REPORTS CCR_QUORUM_NODES_WARN/CCR_CCR_LOCAL_SERVER_WARN

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Error Description:
    In a cluster with firewall configured, mmhealth may
    report ccr_quorum_nodes_warn or/and
    ccr_ccr_local_server_warn for some quorum nodes.
    
    Reported in:
    Spectrum Scale 5.1.1.1 on RHEL7
    
    Known Impact:
    deadlock/daemon crash
    
    Verification steps:
    
    (1) Run "mmccr check -e" on the problematic node, ping or
    connect local CCR server may fail.
    
    mmccr check results:
    CCR Client initialization succeed
    Check CCR authorized key file succeed
    Check CCR cached directory and file succeed
    Check both CCR paxos files succeed
    Ping local CCR server failed (895-Ping local CCR server
    failed)
    Check CCR server IP address lookup succeed
    Ping CCR quorum nodes failed (895-Ping CCR quorum nodes
    failed)
    Check files in CCR committed directory failed
    (895-Connect local CCR server failed)
    Check CCR tiebreaker disks succeed (No tiebreaker disks
    configured)
    
    (2) Check the GPFS log /var/adm/ras/mmfs.log.latest, you
    may see logs as below:
    
    2022-02-27_11:21:03.714+1000: [N] Purged 9001 CCR
    request(s) due to err: 895 (Maximal number of requests in
    operation queue reached (Check for request jam)) since
    2022-02-27_08:59:03.694+1000
    2022-02-27_14:28:15.815+1000: [N] Purged 9004 CCR
    request(s) due to err: 895 (Maximal number of requests in
    operation queue reached (Check for request jam)) since
    2022-02-27_12:05:43.814+1000
    2022-02-27_17:36:15.986+1000: [N] Purged 9002 CCR
    request(s) due to err: 895 (Maximal number of requests in
    operation queue reached (Check for request jam)) since
    2022-02-27_15:14:35.964+1000
    2022-02-27_20:46:16.274+1000: [N] Purged 9004 CCR
    request(s) due to err: 895 (Maximal number of requests in
    operation queue reached (Check for request jam)) since
    2022-02-27_18:23:16.224+1000
    
    (3) Run "netstat -an | grep 1191" on the problematic
    node, you may see a lot of sockets in CLOSE_WAIT or
    LAST_ACK status.
    # netstat -an | grep :1191
    ...........
    tcp        1      0 10.10.10.6:1191
    10.10.10.12:40152      CLOSE_WAIT
    tcp        1      0 10.10.10.6:1191
    10.10.10.26:39114      CLOSE_WAIT
    tcp        1      0 10.10.10.6:1191
    10.10.10.21:49593      CLOSE_WAIT
    tcp        1      0 10.10.10.6:1191
    10.10.10.24:55971      CLOSE_WAIT
    tcp        1      0 10.10.10.6:1191
    10.10.10.26:53336      LAST_ACK
    ...........
    
    Recovery action:
    If you have enough quorum nodes up active,  we can
    # disable the quorum role of this problematic node
       mmchnode --noquorum -N <problematic node>
    # monitor "netstat -an | grep 1191",  wait until those
    TCP sockets gone,  then we can enable the quorum role
    back
       mmchnode --quorum -N <problematic node>
    

Local fix

  • N/A
    

Problem summary

  • CCR becomes slow on a quorum node when the
    configured firewall drops the FIN
    TCP/IP packages of CCR requests.
    

Problem conclusion

  • This problem is fixed in 5.1.2 PTF 6
    To see all Spectrum Scale APARs and
    their respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    This change avoids a slow CCR main
    thread on quorum nodes in case the firewall
    drops FIN packages of CCR requests.
    
    Work around:
    Not available.
    Problem trigger:
    Misconfigured firewall.
    Symptom:
    Performance impact/degradation.
    Platforms affected:
    x86_64-linux only so far.
    Functional Area affected:
    CCR
    Customer Impact:
    High Importance.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ39267

  • Reported component name

    SPEC SCALE ADV

  • Reported component ID

    5737F35AP

  • Reported release

    511

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-04-06

  • Closed date

    2022-06-29

  • Last modified date

    2022-06-29

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE ADV

  • Fixed component ID

    5737F35AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
30 June 2022