IBM Support

IJ03154: ESS: IO HUNG ON REMOTE CLIENT WHEN BOTH ESS IO NODES ARE DOWN

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • When both ESS IO nodes of the same building block are
    shutdown,  NSD client in a remote cluster may experience
    file system operations hung for a long time,  even if the
    file system has more than one replica defined, and there
    is still at least one replica available.
    
    Reported in:
    Spectrum Scale 4.2.2.3 on RHEL7
    
    Known Impact:
    File system access will hung for a long time
    (default up to 82 minutes), or until the ESS IO node is
    started up again.
    
    Verification steps:
    1. Shutdown both ESS IO nodes of the same building block.
    2. Access the File System from a client node in a remote
       cluster.
    3. The access will hung.
    4. Check the GPFS waiters on the client node, you can see
       threads that are waiting for NSD I/O completion on
    node
       <none><none>, something as below:
       Waiting 2.1923 sec since 19:23:24, monitored, thread
    9000 xxxxxxThread: for NSD I/O completion on node <none>
    <none>
    
    Recovery action:
    Bring up ESS IO nodes.
    

Local fix

Problem summary

  • There are two sets of ESS building blocks and GPFS replication
    is enabled. The expectation is that outage of only one ESS
    building block won't block or crash the whole file system.
    However, when the two ESS servers are shutdown simultaneously,
    the system hangs and the remote NSD clients keep in a long retry
    loop to wait for stateful server readiness of the already down
    ESS server pairs, which is unreasonable. Analysis shows that
    the NSD clients in the home cluster can break out the loop
    without going into RGCM query with a list of statically defined
    ESS servers. However, this list isn't passed to the remote
    NSD clients, resulting RGCM query and the waiting in loop issue.
    

Problem conclusion

  • This issue is introduced by commit d0f26a09. The fix is to
    change the behavior a little bit. For statically defined ESS
    servers, we still pass the server list to remote client.
    For dynamically selected Mestor vdisk server, keep the current
    behavior to pass a null server list. This will allow the
    remote clients and the home cluster clients to behave the same.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ03154

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    500

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2018-01-10

  • Closed date

    2018-01-10

  • Last modified date

    2018-04-05

  • APAR is sysrouted FROM one or more of the following:

    IJ03098

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

  • R500 PSY U880930

       18/04/05 I 1000

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"500","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
05 April 2018