IJ03154: ESS: IO HUNG ON REMOTE CLIENT WHEN BOTH ESS IO NODES ARE DOWN

APAR status

Closed as program error.

Error description

When both ESS IO nodes of the same building block are
shutdown,  NSD client in a remote cluster may experience
file system operations hung for a long time,  even if the
file system has more than one replica defined, and there
is still at least one replica available.

Reported in:
Spectrum Scale 4.2.2.3 on RHEL7

Known Impact:
File system access will hung for a long time
(default up to 82 minutes), or until the ESS IO node is
started up again.

Verification steps:
1. Shutdown both ESS IO nodes of the same building block.
2. Access the File System from a client node in a remote
   cluster.
3. The access will hung.
4. Check the GPFS waiters on the client node, you can see
   threads that are waiting for NSD I/O completion on
node
   <none><none>, something as below:
   Waiting 2.1923 sec since 19:23:24, monitored, thread
9000 xxxxxxThread: for NSD I/O completion on node <none>
<none>

Recovery action:
Bring up ESS IO nodes.

Local fix

Problem summary

There are two sets of ESS building blocks and GPFS replication
is enabled. The expectation is that outage of only one ESS
building block won't block or crash the whole file system.
However, when the two ESS servers are shutdown simultaneously,
the system hangs and the remote NSD clients keep in a long retry
loop to wait for stateful server readiness of the already down
ESS server pairs, which is unreasonable. Analysis shows that
the NSD clients in the home cluster can break out the loop
without going into RGCM query with a list of statically defined
ESS servers. However, this list isn't passed to the remote
NSD clients, resulting RGCM query and the waiting in loop issue.

Problem conclusion

This issue is introduced by commit d0f26a09. The fix is to
change the behavior a little bit. For statically defined ESS
servers, we still pass the server list to remote client.
For dynamically selected Mestor vdisk server, keep the current
behavior to pass a null server list. This will allow the
remote clients and the home cluster clients to behave the same.

Temporary fix

Comments

APAR Information

APAR number
IJ03154
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
500
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-01-10
Closed date
2018-01-10
Last modified date
2018-04-05

APAR is sysrouted FROM one or more of the following:

IJ03098
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP

Applicable component levels

R500 PSY U880930
18/04/05 I 1000

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"500","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
05 April 2018

Tips

IJ03154: ESS: IO HUNG ON REMOTE CLIENT WHEN BOTH ESS IO NODES ARE DOWN

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R500 PSY U880930

Document Information

Share your feedback

Need support?