IBM Support

IJ26348: LONG WAITER "WAIT FOR WORKING INDEX ENTRY TO BE COMMITTED"

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A race condition between the RG master
    resign/recovery and the mdi operation
    on the worker side
    could lead to a bug in the RG master recovery and
    cause the the working index (WI)
    entry to be stuck in Assigned state.
    This would further cause the
    integrity manger thread to block
    and cause long waiter "wait for working
    index entry to be committed" on the RG master
    node.  Under this state, it could lead
    to data integrity issues when the next
    RG master resign/recovery event happens.
    
    Note this could only happen in ECE/ESS3K
    environment, and not in legacy ESS environment.
    

Local fix

  • The immediate bug does not cause a problem
    right away, but it does expose a condition that
    additional events could lead to a recovery failure.
    When the recovery failure is detected data has
    already been corrupted. It will require
    special steps to allow recovery to
    complete but in the meantime a
    certain amount of data will be lost.
    

Problem summary

  • A race condition between the RG master
    resign/recovery and the mdi operation
    on the worker side
    could lead to a bug in the RG master recovery and
    cause the the working index (WI)
    entry to be stuck in Assigned state.
    This would further cause the
    integrity manger thread to block
    and cause long waiter "wait for working
    index entry to be committed" on the RG master
    node.  Under this state, it could lead
    to data integrity issues when the next
    RG master resign/recovery event happens.
    
    Note this could only happen in ECE/ESS3K
    environment, and not in legacy ESS environment.
    

Problem conclusion

  • Benefits of the solution:
    Fixed the code so that the working index recovery
    is properly handled during the recovery.
    Otherwise data loss could happen.
    
    Work Around:
    The immediate bug does not cause a problem
    right away, but it does expose a condition that
    additional events could lead to a recovery failure.
    When the recovery failure is detected data has
    already been corrupted. It will require
    special steps to allow recovery to
    complete but in the meantime a
    certain amount of data will be lost.
    
    Problem trigger:
    This is normally exposed by a race condition
    when the RG master resign/recovery event happens
    in the middle of the worker side mdi
    operation, specifically when worker's
    RPC which modifies mdi state, interleaves
    with RG master's RPC which inquires
    about the mdi state.
    
    Symptom:
    The immediate effect will be that on the
    RG (root owner node) a long waiter will be
    observed:
    "wait for working index entry to be committed".
    Under this state, if the metadata block
    related to this working index is updated,
    RG recovery could cause
    metadata roll back and cause
    data integrity issues, or the other
    possible outcome is recovery failure
    which renders a failure message as "MDI VCD
    recovery failure in build used metaslot bitmaps: 16"
    
    Note that this long waiter could
    be caused by other situations, and not
    all cases will result data integrity issue.
    
    Platforms affected: N/A
    
    Functional Area affected: GNR/ECE/ESS3K
    
    Customer Impact: high Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ26348

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    505

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-07-22

  • Closed date

    2020-07-22

  • Last modified date

    2020-07-22

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
23 July 2020