IBM Support

IJ49862: READMIT FAILED TO FIX STALE STRIPS WITH PDISK TRANSIENT PATHWAIT

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • ABSTRACT:
    readmit failed to fix stale strips with pdisk transient
    pathWait
    
    Error Description:
    The rebalance/readmitting failed to fix all stale strips
    when pdisk
     state is in transient pathWait
    
    Reported in:
    Spectrum Scale 5.1.7.1
    
    Known Impact:
    There will be stale strips left for some vtracks. If
    there are new
     pdisk failures, the vtrack reconstruction will fail to
    fix this
     vtrack and cause access error.
    
    Verification steps:
    
    If DA is already in scrub state, run this command on each
    ECE IO node:
    mmfsadm dump vdisk" | grep 'vQueue. pQueue.' | grep -iv
    "count 0"
    If it shows no-zero counts, then you hit this problem.
    
    Recovery action:
    N/A
    
    Local Fix:
    N/A
    

Local fix

Problem summary

  • When daemon restarts on a worker node, it is possible to have a
    race condition that causes worker local state change to take
    place after GNR's readmit operation which intends to repair
    tracks with stale data. The delayed state change could result
    the intended readmit operation to fail to repair the data on the
    given disks, thus result in stale sectors in the tracks which
    could have been fixed once the delayed state change takes place.
    With more disk failures before the next cycle of scan and repair
    operations having a chance to repair these vtracks, it could
    result data loss if number of faults are beyond the fault
    tolerance of the vdisk.
    

Problem conclusion

  • This problem is fixed in 5.1.9.5
    To see all Spectrum Scale APARs and their respective
    Fix solutions refer to page: 
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale
    _apars.html
    
    Benefits of the solution:
    Fixed the code so to avoid the race condition.
    
    Work Around:
    Before the fix is installed, manually verify if there is any
    vtracks stuck in stale state.
    
    Problem trigger:
    Daemon restart on individual ECE node, or shared ESS node (even
    though much less likely), followed by more failing disks.
    
    Symptom:
    Daemon crash
    
    Platforms affected:
    All
    
    
    Functional Area affected:
    GNR
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ49862

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    517

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2024-01-26

  • Closed date

    2024-07-22

  • Last modified date

    2024-07-22

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"517","Line of Business":{"code":"LOB69","label":"Storage TPS"}}]

Document Information

Modified date:
23 July 2024