IJ47220: GNR DEADLOCK WHILE WAITING FOR DISK AVAILABILITY TO STABILIZE

APAR status

Closed as program error.

Error description

A race condition between the distributed GNR Disk hospital can
cause a state update from the master node to a worker node to be
 rejected.

When the master node wishes to release a disk from the
"diagnosing" to "ok" state, it sends a state broadcast to all
worker nodes to instruct them to reflect the pdisk's new master
state locally.

However, this broadcast can race with addition disk problem
reports that are transmitted from the worker to the master.

The result is that the worker node can reject the master's claim
that the disk is healthy, and continue holding the disk in
diagnosing.

This can lead to blocked file system I/O unless another state
change notification is broadcasted from the master, in which
case the worker gets another change to resume I/O to the disk.

Local fix

Restarting the daemon on the nodes with the waiter "Until disk
availability stabilizes" can clear out the waiters.

Problem summary

A race condition between the distributed GNR Disk hospital can
cause a state update from the master node to a worker node to be
 rejected.

When the master node wishes to release a disk from the
"diagnosing" to "ok" state, it sends a state broadcast to all
worker nodes to instruct them to reflect the pdisk's new master
state locally.

However, this broadcast can race with addition disk problem
reports that are transmitted from the worker to the master.

The result is that the worker node can reject the master's claim
that the disk is healthy, and continue holding the disk in
diagnosing.

This can lead to blocked file system I/O unless another state
change notification is broadcasted from the master, in which
case the worker gets another change to resume I/O to the disk.

Problem conclusion

This problem is fixed in 5.1.2.12 
To see all Spectrum Scale APARs and their respective
Fix solutions refer to page:
https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
apars.html

Benefits of the solution:
Fixed the code to properly handle out of order replies on the
worker side, allowing the worker to reason about the sequence of
events from the master side even if individual state updates and
problem report acknowledgements are received out of order.

Work Around:
Restarting the daemon on the nodes with the waiter "Until disk
availability stabilizes" can clear out the waiters.

Problem trigger:
This problem can potentially occur when any local I/O error is
encountered on a pdisk, but in general the race condition in
that path is rare. It is more likely to occur on Spectrum
Storage Scale Erasure Code edition during periods of network
instability when pdisks are likely to encounter many timeout
errors.

Symptom:
Stuck IO

Platforms affected:
Linux Only

Functional Area affected:
ESS/GNR

Customer Impact:
High Importance

Temporary fix

Comments

APAR Information

APAR number
IJ47220
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
512
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2023-06-14
Closed date
2023-07-19
Last modified date
2023-07-19

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
20 July 2023

Tips

IJ47220: GNR DEADLOCK WHILE WAITING FOR DISK AVAILABILITY TO STABILIZE

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?