IBM Support

IJ29812: RECOVERY GROUP RESIGNS IN A SMALL ECE SETUP IN THE PRESENCE OF MISSING DISKS BEFORE FAULT TOLERANCE IS REACHED.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • In Spectrum Scale Erasure code Edition, it is
    possible for all of the server's pdisks
    (physical disks) to become missing, either due
    to network failure, node failure, or through a planned
    "node suspend" maintenance procedure. When this
    happens the system will continue to function if there is
    sufficient remaining fault tolerance. However, smaller
    configurations with less ECE nodes are exposed to a
    race condition where pdisk state changes can  interrupt
    a system-wide descriptor update which
    causes the recovery group to resign.
    It is also possible to experience this problem with higher
    probability when using small ESS configurations,
    such as the GS1 or GS2 enclosures.
    
    For both ESS and ECE, a possible symptom may appear in
    the mmfs.log in this form when a pdisk state change is quickly
    followed by a resign message claiming VCD write failures
    before the system fault tolerance is exceeded:
    2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG
    rg1 state changed from ok/00000.180 to missing/
    suspended/00050.180.
    2020-12-01_19:01:36.697-0400: [E] Beginning to resign
    recovery group rg1 due to "VCD write failure", caller err 217
    when "updating VCD: RGD"
    
    Note that a "VCD write failure" with err 217 is a generic
    message
    issued when fault tolerance is exceeded during critical system
    updates, but in this case the race condition resigns the system
    when only a handful of missing disks are found.
    

Local fix

  • The workaround is to allow the system to resign and recovery
    on its own. For smaller ECE configurations, avoid leaving a
    node suspended for long periods of time during maintenance
    tasks.
    

Problem summary

  • In Spectrum Scale Erasure code Edition, it is
    possible for all of the server's pdisks
    (physical disks) to become missing, either due
    to network failure, node failure, or through a planned
    "node suspend" maintenance procedure. When this
    happens the system will continue to function if there is
    sufficient remaining fault tolerance. However, smaller
    configurations with less ECE nodes are exposed to a
    race condition where pdisk state changes can  interrupt
    a system-wide descriptor update which
    causes the recovery group to resign.
    It is also possible to experience this problem with higher
    probability when using small ESS configurations,
    such as the GS1 or GS2 enclosures.
    
    For both ESS and ECE, a possible symptom may appear in
    the mmfs.log in this form when a pdisk state change is quickly
    followed by a resign message claiming VCD write failures
    before the system fault tolerance is exceeded:
    2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG
    rg1 state changed from ok/00000.180 to missing/
    suspended/00050.180.
    2020-12-01_19:01:36.697-0400: [E] Beginning to resign
    recovery group rg1 due to "VCD write failure", caller err 217
    when "updating VCD: RGD"
    
    Note that a "VCD write failure" with err 217 is a generic
    message
    issued when fault tolerance is exceeded during critical system
    updates, but in this case the race condition resigns the system
    when only a handful of missing disks are found.
    

Problem conclusion

  • Benefits of the solution:
    Code fix eliminates the race condition, making the system
    more stable in the presence of missing disks.
    
    Work Around:
    The workaround is to allow the system to resign and recovery
    on its own. For smaller ECE configurations, avoid leaving a
    node suspended for long periods of time during maintenance
    tasks.
    
    Problem trigger:
    pdisk state updates and system-wide descriptor updates
    on configurations with smaller recovery groups, such as a
    small number of nodes on ECE or a small Enclosure on ESS.
    
    Symptom:
    Unexpected Results/Behavior
    
    Platforms affected:
    N/A
    Functional Area affected:
    ESS/GNR
    
    Customer Impact:
    Suggested
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ29812

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    510

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-12-11

  • Closed date

    2020-12-11

  • Last modified date

    2020-12-11

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ29910

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"510","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
16 December 2020