An issue with IBM Flash Core Modules in IBM Storage Scale System 6000 could lead to detected data loss

Notification

Risk classification

HIPER (High Impact and/or Pervasive)

Risk categories

Data Loss

Abstract

IBM has identified an issue in IBM Storage Scale System 6.2.0.0 through 6.2.2.0 code, where detected data loss can occur in an IBM Storage Scale System 6000 that uses IBM Flash Core Modules (FCM).

Description

During node initialization, Linux may reset the NVMe controllers and perform a device discovery. As part of the discovery, Linux sends a wide range of NVMe identify administrator commands to find attached devices. These discovery activities are typically performed during Linux start or recovery of an unresponsive device.

FCMs do not support all NVMe identify commands. When an FCM device rejects an unsupported identify commands, it can expose an issue that may result is a misread or miswrite as follows:

While the unsupported command that is being rejected is received from one port, read or write commands received through the other port of the same drive may return incorrect data from the drive (transient misread) or write incorrect data to the media (miswrite).
Read or write operation from one canister while the peer canister is initializing (booting) or a device being recovered may be subjected to this exposure.

GNR implements a strong and powerful data validation strategy and will automatically recover most of the misread events:

Both data and its associated metadata have a checksum stored with the data in the FCM media during a write operation.
The checksum is checked and validated during read operation or background scrub operations.
When a misread happens, GNR can detect it during the checksum validation, and corrects the incorrect data by recreating data and writing it. GNR logs all incorrect checksum events.
In very rare occasions, GNR might not be able to recreate the data (and therefore correct the media) due to checksum errors observed from multiple drives exceeding the redundancy (that is, three errors in the same 8+2P RAID stripe) and will return an error to the host request. The requested data might be permanently lost.

Users Affected:
This issue may affect clients that use all of the following:

IBM Storage Scale System 6000 with FCM
IBM Storage Scale System 6.2.0.0 through 6.2.2.0
FCM firmware at 4_1_10 or lower version

Problem Determination:
Verify that you are running the affected FCM firmware:

[root~]# mmlsfirmware --type drive 
enclosure firmware available 
type product id serial number level firmware location 
---- ---------- ------------- -------- -------- -------- 
drive 1014101406e4 78L000F 4_1_8 4_1_11 c145f14ess6k03a Enclosure 78L000F Drive 10 
drive 1014101406e4 78L000F 4_1_8 4_1_11 c145f14ess6k03a Enclosure 78L000F Drive 12

Recommended Action

Customers that use IBM Storage Scale System 6000 with FCM are strongly recommended to upgrade to 6.2.2.1 or later:

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Storage+Scale+System&release=6.2.2&platform=All&function=all

After the upgrade, apply the code fix for this issue by updating the firmware on the FCM drives. Run the following command on the EMS to update the FCM firmware:

mmchfirmware --type drive

Verify that the firmware was updated, it should show 4_1_11 or greater:

mmlsfirmware --type drive

Any miswrite location that is not read will be corrected by the automatic GNR background scrubber.

Internal reference: D.343530

Reference ID

Internal reference: D.343530

Date first published

26 March 2025

[{"Risk Classification":"HIPER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSP944","label":"IBM Storage Scale System"},"ARM Category":[{"code":"a8m3p000000PCUYAA4","label":"ESS 6000"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Tips

An issue with IBM Flash Core Modules in IBM Storage Scale System 6000 could lead to detected data loss

Notification

Risk classification

Risk categories

Abstract

Description

Recommended Action

Reference ID

Date first published

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?