Notification
Risk classification
HIPER (High Impact and/or Pervasive)
Risk categories
Data Loss
Abstract
IBM has identified an issue in IBM Storage Scale System 6.2.0.0 through 6.2.2.0 code, where detected data loss can occur in an IBM Storage Scale System 6000 that uses IBM Flash Core Modules (FCM).
Description
During node initialization, Linux may reset the NVMe controllers and perform a device discovery. As part of the discovery, Linux sends a wide range of NVMe identify administrator commands to find attached devices. These discovery activities are typically performed during Linux start or recovery of an unresponsive device.
FCMs do not support all NVMe identify commands. When an FCM device rejects an unsupported identify commands, it can expose an issue that may result is a misread or miswrite as follows:
- While the unsupported command that is being rejected is received from one port, read or write commands received through the other port of the same drive may return incorrect data from the drive (transient misread) or write incorrect data to the media (miswrite).
- Read or write operation from one canister while the peer canister is initializing (booting) or a device being recovered may be subjected to this exposure.
GNR implements a strong and powerful data validation strategy and will automatically recover most of the misread events:
- Both data and its associated metadata have a checksum stored with the data in the FCM media during a write operation.
- The checksum is checked and validated during read operation or background scrub operations.
- When a misread happens, GNR can detect it during the checksum validation, and corrects the incorrect data by recreating data and writing it. GNR logs all incorrect checksum events.
- In very rare occasions, GNR might not be able to recreate the data (and therefore correct the media) due to checksum errors observed from multiple drives exceeding the redundancy (that is, three errors in the same 8+2P RAID stripe) and will return an error to the host request. The requested data might be permanently lost.
Users Affected:
This issue may affect clients that use all of the following:
- IBM Storage Scale System 6000 with FCM
- IBM Storage Scale System 6.2.0.0 through 6.2.2.0
- FCM firmware at 4_1_10 or lower version
Problem Determination:
Verify that you are running the affected FCM firmware:
[root~]# mmlsfirmware --type drive
enclosure firmware available
type product id serial number level firmware location
---- ---------- ------------- -------- -------- --------
drive 1014101406e4 78L000F 4_1_8 4_1_11 c145f14ess6k03a Enclosure 78L000F Drive 10
drive 1014101406e4 78L000F 4_1_8 4_1_11 c145f14ess6k03a Enclosure 78L000F Drive 12
Recommended Action
Customers that use IBM Storage Scale System 6000 with FCM are strongly recommended to upgrade to 6.2.2.1 or later:
After the upgrade, apply the code fix for this issue by updating the firmware on the FCM drives. Run the following command on the EMS to update the FCM firmware:
mmchfirmware --type drive
Verify that the firmware was updated, it should show 4_1_11 or greater:
mmlsfirmware --type drive
Any miswrite location that is not read will be corrected by the automatic GNR background scrubber.
Internal reference: D.343530
Reference ID
Internal reference: D.343530
Date first published
26 March 2025
Was this topic helpful?
Document Information
Modified date:
26 March 2025
UID
ibm17214673