IBM Support

IJ42511: MEDIA ERROR ON ESS NVME DEVICE MAY CAUSE A PDISK TO GO MISSING

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • When an NVMe device is becoming active, it is necessary
    for ESS to poll the device to determine if it
    is ready for I/O. It does this by polling the final LBA
    of the device to see if reads are allowed.
    This is because the devices become visible to the OS
    prior to becoming ready to handle read/write requests.
    The original implementation, however, would incorrectly
    claim that media errors on the final LBA mean
    that the device isn't ready.
    As a result, it is possible that legitimate media
    problems on the final LBA of an NVMe will induce ESS
    to claim that the entire device is not
    available.
    This problem can be identified by an NVMe pdisk going
    missing after seeing  unrecovered read errors
    in the Spectrum Scale RAID recovery group
    event log (mmvdisk recoverygroup list --events)
    

Local fix

Problem summary

  • When an NVMe device is becoming active, it is necessary
    for ESS to poll the device to determine if it
    is ready for I/O. It does this by polling the final LBA
    of the device to see if reads are allowed.
    This is because the devices become visible to the OS
    prior to becoming ready to handle read/write requests.
    The original implementation, however, would incorrectly
    claim that media errors on the final LBA mean
    that the device isn't ready.
    As a result, it is possible that legitimate media
    problems on the final LBA of an NVMe will induce ESS
    to claim that the entire device is not
    available.
    This problem can be identified by an NVMe pdisk going
    missing after seeing  unrecovered read errors
    in the Spectrum Scale RAID recovery group
    event log (mmvdisk recoverygroup list --events)
    

Problem conclusion

  • This problem is fixed in 5.1.2 PTF 7
    To see all Spectrum Scale APARs and
    their respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale
    apars.html
    
    
    Benefits of the solution:
    Code has been enhanced to check for more specific NVMe
    error conditions that can determine if the
    device is ready or not.
    
    Work Around:  None
    Problem trigger
    Corrupted physical block mapped to the final logical
    block within an NVMe namespace.
    Symptom: Component Level Outage
    Platforms affected:  Linux Only
    Functional Area affected: ESS/GNR
    Customer Impact: High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ42511

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    512

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-09-26

  • Closed date

    2022-09-26

  • Last modified date

    2022-09-26

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 September 2022