APAR status
Closed as program error.
Error description
When both ESS IO nodes of the same building block are shutdown, NSD client in a remote cluster may experience file system operations hung for a long time, even if the file system has more than one replica defined, and there is still at least one replica available. Reported in: Spectrum Scale 4.2.2.3 on RHEL7 Known Impact: File system access will hung for a long time (default up to 82 minutes), or until the ESS IO node is started up again. Verification steps: 1. Shutdown both ESS IO nodes of the same building block. 2. Access the File System from a client node in a remote cluster. 3. The access will hung. 4. Check the GPFS waiters on the client node, you can see threads that are waiting for NSD I/O completion on node <none><none>, something as below: Waiting 2.1923 sec since 19:23:24, monitored, thread 9000 xxxxxxThread: for NSD I/O completion on node <none> <none> Recovery action: Bring up ESS IO nodes.
Local fix
Problem summary
There are two sets of ESS building blocks and GPFS replication is enabled. The expectation is that outage of only one ESS building block won't block or crash the whole file system. However, when the two ESS servers are shutdown simultaneously, the system hangs and the remote NSD clients keep in a long retry loop to wait for stateful server readiness of the already down ESS server pairs, which is unreasonable. Analysis shows that the NSD clients in the home cluster can break out the loop without going into RGCM query with a list of statically defined ESS servers. However, this list isn't passed to the remote NSD clients, resulting RGCM query and the waiting in loop issue.
Problem conclusion
This issue is introduced by commit d0f26a09. The fix is to change the behavior a little bit. For statically defined ESS servers, we still pass the server list to remote client. For dynamically selected Mestor vdisk server, keep the current behavior to pass a null server list. This will allow the remote clients and the home cluster clients to behave the same.
Temporary fix
Comments
APAR Information
APAR number
IJ03154
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
500
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-01-10
Closed date
2018-01-10
Last modified date
2018-04-05
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
R500 PSY U880930
18/04/05 I 1000
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"500","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
05 April 2018