IBM Support

IJ47466: ESS3K PEMS GOING OFFLINE AND POSSIBLE KERNEL CRASH SCAN IS RUN WHILE PEMS IS OFFLINE.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • 3K customers using ESS 6.1.5 and above are seeing pems going
    offline so pems hang. If customer runs scsi-rescan there is
    potential exposure to crash at pemsSlaveDestroy+0x46.
    

Local fix

  • If pems goes offline we need to avoid scsi-rescan to avoid the
    crash. To recover from offline we can restart the pemsmod module
     for now until fix is applied
    

Problem summary

  • 3K customers using ESS 6.1.5 and above are seeing pems going
    offline so pems hang. If customer runs scsi-rescan there is
    potential exposure to crash at pemsSlaveDestroy+0x46.
    

Problem conclusion

  • This problem is fixed in 5.1.8.1 
    To see all Spectrum Scale APARs and their respective
    Fix solutions refer to page:
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    This solution will fix a potential hang at pemsCliCmdQueue and
    will fix the crash at pemsSlaveDestroy+0x46 if scsi-rescan
    happens when pems is offline.
    
    Work Around:
    If pems goes offline we need to avoid scsi-rescan to avoid the
    crash. To recover from offline we can restart the pemsmod module
    for now until fix is applied
    
    Problem trigger:
    Not sure why this issue is being triggered by ESS 6.1.5 and
    above. Not recreated at lab.
    
    Symptom:
    If pems is going offline there will be some commands like
    tsplatformstat, sginfo  and other that will fail.
    At the dmesg will see:
    scsi 11:0:0:0: Device offlined - not ready after error recovery
    scsi 11:0:0:0: rejecting I/O to offline device
    INFO: task pemsCliCmdQueue:11928 blocked for more than 120
    seconds.and if doing scsi-rescan the crash stack trace will have
    this pemsSlaveDestroy+0x46.
    
    Functional Area affected:
    ESS 3K
    
    Customer Impact:
    High importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ47466

  • Reported component name

    SPEC SCALE DME

  • Reported component ID

    5737F34AP

  • Reported release

    518

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2023-07-05

  • Closed date

    2023-07-27

  • Last modified date

    2023-07-27

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE DME

  • Fixed component ID

    5737F34AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"518","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
27 July 2023