IBM Support

An issue with IBM Flash Core Modules in IBM Storage Scale System 6000 could lead to detected data loss

Notification


Risk classification

HIPER (High Impact and/or Pervasive)

Risk categories

Data Loss

Abstract

IBM has identified an issue in IBM Storage Scale System 6.2.0.0 through 6.2.2.0 code, where detected data loss can occur in an IBM Storage Scale System 6000 that uses IBM Flash Core Modules (FCM). 

Description

During node initialization, Linux may reset the NVMe controllers and perform a device discovery. As part of the discovery, Linux sends a wide range of NVMe identify administrator commands to find attached devices. These discovery activities are typically performed during Linux start or recovery of an unresponsive device. 

FCMs do not support all NVMe identify commands. When an FCM device rejects an unsupported identify commands, it can expose an issue that may result is a misread or miswrite as follows:

  • While the unsupported command that is being rejected is received from one port, read or write commands received through the other port of the same drive may return incorrect data from the drive (transient misread) or write incorrect data to the media (miswrite). 
  • Read or write operation from one canister while the peer canister is initializing (booting) or a device being recovered may be subjected to this exposure. 

GNR implements a strong and powerful data validation strategy and will automatically recover most of the misread events: 

  • Both data and its associated metadata have a checksum stored with the data in the FCM media during a write operation.
  • The checksum is checked and validated during read operation or background scrub operations.
  • When a misread happens, GNR can detect it during the checksum validation, and corrects the incorrect data by recreating data and writing it. GNR logs all incorrect checksum events. 
  • In very rare occasions, GNR might not be able to recreate the data (and therefore correct the media) due to checksum errors observed from multiple drives exceeding the redundancy (that is, three errors in the same 8+2P RAID stripe) and will return an error to the host request. The requested data might be permanently lost. 

Users Affected: 
This issue may affect clients that use all of the following: 

  • IBM Storage Scale System 6000 with FCM 
  • IBM Storage Scale System 6.2.0.0 through 6.2.2.0 
  • FCM firmware at 4_1_10 or lower version

Problem Determination: 
Verify that you are running the affected FCM firmware: 

[root~]# mmlsfirmware --type drive 
enclosure firmware available 
type product id serial number level firmware location 
---- ---------- ------------- -------- -------- -------- 
drive 1014101406e4 78L000F 4_1_8 4_1_11 c145f14ess6k03a Enclosure 78L000F Drive 10 
drive 1014101406e4 78L000F 4_1_8 4_1_11 c145f14ess6k03a Enclosure 78L000F Drive 12 

Reference ID

Internal reference: D.343530 

Date first published

26 March 2025

[{"Risk Classification":"HIPER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSP944","label":"IBM Storage Scale System"},"ARM Category":[{"code":"a8m3p000000PCUYAA4","label":"ESS 6000"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
26 March 2025

UID

ibm17214673