IBM Support

IJ33367: ENHANCE NFS HEALTH STATE CHECK PERIOD WHILE IN GRACE MODE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A failover situation was generated by the NFS
    health monitor while a node was
    expelled in the cluster.
    The NFS service monitor detected
    a potential hung situation.
    As a result a failover was triggered even
    though the system was able to
    recover itself after several minutes.
    

Local fix

  • The systemhealth monitor can be configured via
    a configuration option to signal a degraded state
    (nfs_unresponsive event) instead of triggering a
    failover (nfs_not_active event, error state).
    

Problem summary

  • A failover situation was generated by the NFS
    health monitor while a node was
    expelled in the cluster.
    The NFS service monitor detected
    a potential hung situation.
    As a result a failover was triggered even
    though the system was able to
    recover itself after several minutes.
    

Problem conclusion

  • This problem is fixed in 5.1.1  PTF 2
    To see all Spectrum Scale APARs and their
    respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    Benefits of the solution:
    The NFS statistics check period (expect increasing
    number of NFS operations to
    signal a healthy NFS service)
    can now be configured via a new variable (maxwaittime)
    in the /var/mmfs/mmsysmon/mmsysmonitor.conf file.
    The default value is set to 70 seconds and can be
    increased as needed to avoid an
    unwanted failover if the cluster
    is in a state which needs more time to recover.
    
    Work Around:
    The systemhealth monitor can be configured via
    a configuration option to signal a degraded state
    (nfs_unresponsive event) instead of triggering a
    failover (nfs_not_active event, error state).
    
    Problem trigger:
    The NFS service monitor detected a potential
    hung situation, which means that the
    NFS NULL check failed and
    the number of internal NFS operations did not
    increase over a while (around 60 seconds).
    During that time NFS is in a grace mode
    (allow previous clients to reclaim their locks)
    and therefore not able to let
    new clients start their I/O work.
    This grace time was not considered by the
    systemhealth monitor, but it should
    increase the waiting time.
    
    Symptom:
    Performance Impact/Degradation
    
    Platforms affected:
    ALL Linux OS environments (CES nodes)
    
    Functional Area affected:
    System Health
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ33367

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    511

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-06-22

  • Closed date

    2021-06-22

  • Last modified date

    2021-06-22

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511"}]

Document Information

Modified date:
23 June 2021