APAR status
Closed as program error.
Error description
When an NVMe device is becoming active, it is necessary for ESS to poll the device to determine if it is ready for I/O. It does this by polling the final LBA of the device to see if reads are allowed. This is because the devices become visible to the OS prior to becoming ready to handle read/write requests. The original implementation, however, would incorrectly claim that media errors on the final LBA mean that the device isn't ready. As a result, it is possible that legitimate media problems on the final LBA of an NVMe will induce ESS to claim that the entire device is not available. This problem can be identified by an NVMe pdisk going missing after seeing unrecovered read errors in the Spectrum Scale RAID recovery group event log (mmvdisk recoverygroup list --events)
Local fix
Problem summary
When an NVMe device is becoming active, it is necessary for ESS to poll the device to determine if it is ready for I/O. It does this by polling the final LBA of the device to see if reads are allowed. This is because the devices become visible to the OS prior to becoming ready to handle read/write requests. The original implementation, however, would incorrectly claim that media errors on the final LBA mean that the device isn't ready. As a result, it is possible that legitimate media problems on the final LBA of an NVMe will induce ESS to claim that the entire device is not available. This problem can be identified by an NVMe pdisk going missing after seeing unrecovered read errors in the Spectrum Scale RAID recovery group event log (mmvdisk recoverygroup list --events)
Problem conclusion
This problem is fixed in 5.1.2 PTF 7 To see all Spectrum Scale APARs and their respective fix solutions refer to page https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale apars.html Benefits of the solution: Code has been enhanced to check for more specific NVMe error conditions that can determine if the device is ready or not. Work Around: None Problem trigger Corrupted physical block mapped to the final logical block within an NVMe namespace. Symptom: Component Level Outage Platforms affected: Linux Only Functional Area affected: ESS/GNR Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ42511
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
512
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-09-26
Closed date
2022-09-26
Last modified date
2022-09-26
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
26 September 2022