IBM Support

IBM Spectrum Scale Erasure Code Edition (ECE) Alert: Potential data corruption with NVMe drive.

Flashes (Alerts)


Abstract

IBM has identified a potential data corruption issue in IBM Spectrum Scale Erasure Code Edition (ECE) with NVMe drive while having network issue.

Content

In IBM Spectrum Scale Erasure Code Edition( ECE) V5.0.3.1 onwards, errors such as RDMA network failures, could cause a data write fail. If a data write fail occurs, the Disk Hospital would investigate the cause for failure. As part of investigation, it tries to read and write a few sectors of data to the same location. If the Disk Hospital investigation is for NVMe drive, bad data may be written back which could cause an undetected data loss or data corruption.
When this occurs, an error messages will be logged in mmfs.log, for example :
[E] Error 214 code 154 processing fast-write log for LG LG007 of RG rg1.

[E] Beginning to resign log group LG007 in recovery group rg1 due to "recovery failure", caller err 214 when "recovering log group worker"
Special attention should be given to messages starting with "[E] Error validating", for example:
[E] Error validating trailer checksum in vdisk RG001LG009VS002 vtrack 7687050 data segment 22 pdisk n007p003 psector 19311753056 vsector 62972315005
Users Affected:
This issue may affect customers running Spectrum Scale ECE V5.0.3.1 or later.
Recommendations:
Customers that are affected should apply Spectrum Scale ECE V5.0.5 or later.  The fix is available from Fix Central at:
If you cannot apply the above release level, contact IBM Services to obtain and apply an efix for your code level(s):
    •    For IBM Spectrum Scale V5.0.3.1 through 5.0.4.4,  reference APAR IJ24518

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"ARM Category":[{"code":"a8m50000000KzsnAAC","label":"GSS->Not GPFS->Storage"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"5.0.3","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m0z0000001gm0AAA","label":"ECE"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"5.3","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
26 May 2020

UID

ibm16210439