IBM Support

IJ47220: GNR DEADLOCK WHILE WAITING FOR DISK AVAILABILITY TO STABILIZE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A race condition between the distributed GNR Disk hospital can
    cause a state update from the master node to a worker node to be
     rejected.
    
    When the master node wishes to release a disk from the
    "diagnosing" to "ok" state, it sends a state broadcast to all
    worker nodes to instruct them to reflect the pdisk's new master
    state locally.
    
    However, this broadcast can race with addition disk problem
    reports that are transmitted from the worker to the master.
    
    The result is that the worker node can reject the master's claim
    that the disk is healthy, and continue holding the disk in
    diagnosing.
    
    This can lead to blocked file system I/O unless another state
    change notification is broadcasted from the master, in which
    case the worker gets another change to resume I/O to the disk.
    

Local fix

  • Restarting the daemon on the nodes with the waiter "Until disk
    availability stabilizes" can clear out the waiters.
    

Problem summary

  • A race condition between the distributed GNR Disk hospital can
    cause a state update from the master node to a worker node to be
     rejected.
    
    When the master node wishes to release a disk from the
    "diagnosing" to "ok" state, it sends a state broadcast to all
    worker nodes to instruct them to reflect the pdisk's new master
    state locally.
    
    However, this broadcast can race with addition disk problem
    reports that are transmitted from the worker to the master.
    
    The result is that the worker node can reject the master's claim
    that the disk is healthy, and continue holding the disk in
    diagnosing.
    
    This can lead to blocked file system I/O unless another state
    change notification is broadcasted from the master, in which
    case the worker gets another change to resume I/O to the disk.
    

Problem conclusion

  • This problem is fixed in 5.1.2.12 
    To see all Spectrum Scale APARs and their respective
    Fix solutions refer to page:
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Fixed the code to properly handle out of order replies on the
    worker side, allowing the worker to reason about the sequence of
    events from the master side even if individual state updates and
    problem report acknowledgements are received out of order.
    
    Work Around:
    Restarting the daemon on the nodes with the waiter "Until disk
    availability stabilizes" can clear out the waiters.
    
    Problem trigger:
    This problem can potentially occur when any local I/O error is
    encountered on a pdisk, but in general the race condition in
    that path is rare. It is more likely to occur on Spectrum
    Storage Scale Erasure Code edition during periods of network
    instability when pdisks are likely to encounter many timeout
    errors.
    
    Symptom:
    Stuck IO
    
    Platforms affected:
    Linux Only
    
    Functional Area affected:
    ESS/GNR
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ47220

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    512

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2023-06-14

  • Closed date

    2023-07-19

  • Last modified date

    2023-07-19

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
20 July 2023